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Resume 

L'imagerie par resonance magnetique fonctionnelle (IRMf) permet d'acquerir des images 
tridimensionnelles de I'activite cerebrale d'un sujet soumis a une sequence de stimula- 
tions sensorielles. L'analyse statistique des ces donnees permet de detecter les aires 
cerebrales actives en reponse aux differentes stimulations. 

Lorsque plusieurs sujets ont ete recrutes pour une experience, l'analyse de groupe con- 
siste a generaliser les resultats individuels a la population d'interet dont sont issus les 
sujets. La variabilite morphologique du cerveau liumain rend cependant la comparaison 
des images acquises sur les differents sujets problematique 

L'approche usuelle pour contrer cette difficulte consiste a recaler les sujets dans un 
referentiel commun, puis de comparer les cerveau separement en chaque point de ce 
referentiel. Cette etape de recalage n'etant jamais parfaite, il en resulte une incertitude 
sur la localisation spatiale de chaque sujet. 

Nous proposons dans un premier temps d'etendre le modele classique d'analyse de groupe 
afin de prendre en compte cette incertitude spatiale. Dans un deuxieme temps, nous 
developpons a partir de ce modele une nouvelle approche de detection d'aires cerebrales 
actives, basee sur des regions d'interet predefinies plutot que sur les procedures de seuil- 
lage couramment utilisees. 
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This thesis is dedicated to the statistical analysis of multi-subject fMRI data, with 
the purpose of identifying bain structures involved in certain cognitive or sensori-motor 
tasks, in a reproducible way across subjects. To overcome certain limitations of standard 
voxel-based testing methods, as implemented in the Statistical Parametric Mapping 
(SPM) software, we introduce a Bayesian model selection approach to this problem, 
meaning that the most probable model of cerebral activity given the data is selected 
from a pre-defined collection of possible models. 

Based on a parcellation of the brain volume into functionally homogeneous regions, 
each model corresponds to a partition of the regions into those involved in the task un- 
der study and those inactive. This allows to incorporate prior information, and avoids 
the dependence of the SPM-like approach on an arbitrary threshold, called the cluster- 
forming threshold, to define active regions. By controlling a Bayesian risk, our approach 
balances false positive and false negative risk control. Furthermore, it is based on a gen- 
erative model that accounts for the spatial uncertainty on the localization of individual 
effects, due to spatial normalization errors. 

On both simulated and real fMRI datasets, we show that this new paradigm corrects 
several biases of the SPM-like approach, which either swells or misses the different active 
regions, depending on the choice of a cluster-forming threshold. 
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Chapter 1 



Introduction 



1.1 Context and objectives 

This thesis is dedicated to the analysis of multi-subject £MRI data. Functional magnetic 
resonance imaging (fMRI) is a modality of in vivo brain imaging that allows to measure 
the variations of cerebral blood oxygen levels induced by the neural activity of a subject 
lying inside a MRI scanner and submitted to a series of stimuli (see Figure 1.1). One 
of the main goals of fMRI group studies, through the analysis of the data acquired 
on a cohort of subjects, is to identify brain structures involved in certain cognitive or 
sensori-motor tasks, in a reproducible way across subjects. 

Many sources of variability are present in fMRI datasets, making statistical methods 
necessary to perform such analyses. These include, but are not limited to, the movements 
of the subject inside the machine, his or her level of attention during the experiment, 
various physiological signals, such as heartbeats and breathings, etc. Furthermore, the 
BOLD response itself varies across the different runs of a single experiment, due to 
habituation and attentional effects. For all these reasons, the measures acquired on 
each subject are uncertain, and variable should the experiment be reproduced with the 
same subject. Additionally, each subject reacts differently when exposed to the same 
stimuli, so a certain variability in the brain activation pattern is also expected across 
the subjects. 

Accounting for these two types of variability, classically referred to as within-suhiect 
and between-subject, is a well-known problem in the literature on statistical inference 
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Figure 1.1: A typical fMRI experiment. The subject, lying in the MRI scanner (upper 

left) is submitted to a series of stimuli (upper right), visual, auditory or other, while 

a series of 3D images of his/her brain activity is acquired (lower right). A structural 

image of the subject's brain (lower left) is also acquired, using the same scanner. 



for biological data [Mcculloch and Searle, 2001]. Another challenge, specific to medical 
imaging, and in particular to neuroimaging, is the important variability of the human 
brain anatomy, which makes the comparison of brain images from different subjects far 
from straightforward. To address this, individual images are most often normalized to 
a common brain template using nonrigid registration. However, registration is prone to 
errors (even assuming the existence of point-to-point correspondences between different 
brains) , hence it does not seem reasonable to assume that homologous points are aligned 
across subjects. 

As a consequence of all these fluctuations, the activation pattern inferred from any given 
fMRI dataset can only be determined with a limited amount of confidence. Ideally, the 
statistical procedure used to analyse the data should therefore account for the different 
sources of variability, and assess the uncertainty they induce on the estimated activation 
pattern. 

The purpose of this work is to propose such a statistical analysis framework. This is 
still an open problem, though the development of appropriate methods for fMRI group 
inference has been an active field of research for at least two decades. The most widely 
used approach to address these questions is the co-called mass- univariate, or voxel-based 
detection, popularized by [Friston, 1997]. It starts with normalizing individual images 
onto a common brain template, as mentioned above. Next, a t-statistic is computed 
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in each voxel to locally assess mean group effects. The candidate regions are then 
defined as the connected components (clusters) of the resulting statistical map above an 
arbitrary threshold called the cluster-forming threshold. Only the clusters whose sizes 
exceed a critical value are reported. This critical size acts as a second threshold, and 
is generally tuned to control the probability of one or more clusters being detected by 
chance. For scientific reporting purposes, the detected clusters may finally be related 
to known anatomical regions based on expert knowledge, or using a digital brain atlas 
such as the Automated Atlas Label (AAL) [Tzourio-Mazoyer et al., 2002]. 

Though simple and widely applicable, this approach suffers from several shortcomings. 
One of these is the dependence on an arbitrary cluster-forming threshold. High values of 
this threshold may result in missing active regions, and low values in merging function- 
ally distinct regions, yielding poor localization power, due to the fact that active voxels 
cannot be localized within each detected cluster [Hayasaka and Nichols, 2003]. Fur- 
thermore, the absence of activations outside the detected clusters cannot be assessed. 
Consequently, there is no guarantee that the whole functional network can be recovered. 
Finally, due to unavoidable intersubject registration errors, the observed activations are 
not well-localized, and possibly displaced across distinct functional regions, which may 
result in blurring the group activation map, and create non-handled false positives. 

To date, these limitations of the mass-univariate approach have been tackled separately. 
This thesis introduces a new approach for fMRI group data analysis that addresses them 
jointly. 

1.2 Organization and main contributions 

The remaining of this thesis is organized in six chapters, which we now summarize: 

• In Chapter 2, we give a review of current approaches for fMRI group data analysis, 
with a focus on mass-univariate detection. This was popularized by the Statistical 
Parametric Mapping (SPM) software package, and we refer to it as SPM-like in 
the following. Our goal here is not to give a complete list of existing methods, but 
rather summarize the main directions of research that have been followed up to 
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now. This chapter also contains several contributions we have made to the SPM- 
like approach. [Keller et al., 2007, Keller and Roche, 2008] present an implemen- 
tation of the Gaussian two-level group model using the maximum likelihood ratio 
test, a nonparametric generalization of which is developed in [Roche et al., 2007]. 
We have also participated to another paper [Thirion et al., 2006b] describing a 
high-level group analysis approach which models the spatial variability of the ac- 
tivation patterns, and which we have included in this review. 

• In Chapter 3, we propose to overcome the dependence of the SPM-like approach on 
an arbitrary detection threshold, and its exclusive control over false positives, by 
using the random threshold approach developed in [Lavielle and Ludeha, 2007]. 
One of the key ideas of this chapter, which we re-exploit in the following, is to 
re-interpret the task of detecting activations, traditionally seen from a multiple 
testing point of view, as a model selection problem. This work has been submitted 
to the Canadian Journal of Statistics. 

• Chapter 4 proposes a detection method that relaxes the assumption of perfect 
match. To do this, we extend the classical mass-univariate model by incorporating 
a set of latent variables representing registration errors, and model them as random 
deformation fields. Our main results consist in demonstrating the stretching effect 
of the group estimated activation pattern due to neglecting spatial uncertainty, and 
its compensation through our approach. A first version of this work was published 
in [Keller et al., 2008]. 

• Chapter 5 introduces a new paradigm for fMRI group data analysis that ad- 
dresses jointly some key limitations of the SPM-like approach, and was published 
in [Keller et al., 2009]. Based on a Bayesian model selection framework, regions in- 
volved in the task under study are selected according to the posterior probabilities 
of a nonzero mean activation, given a pre-defined parcellation of the search vol- 
ume into functionally homogeneous regions. Thus our approach is threshold-free, 
while allowing to incorporate prior information, provided that the parcellation is 
sensible. By controlling a Bayesian risk, our approach balances false positive and 
false negative risks, with weights than can be tuned depending on the applica- 
tion domain. Importantly, it is based on the same spatial uncertainty model as in 
Chapter 4, and thus accounts for the mis-alignment of individual images, due to 
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inevitable registration errors. Hence for each subject, the membership of a voxel 
to a given region is probabilistic rather than deterministic. 

Results on simulated data show that neglecting spatial uncertainty may result in 
a bias toward false positives when selecting active regions. This is a consequence 
of the stretching effect evidenced in Chapter 4. We also show that this bias can 
be compensated when modeling spatial uncertainty. 

In Chapter 6 we validate our model selection approach on a real fMRI dataset, 
by successfully recovering the whole network of regions involved in basic number 
and language processing, in accordance with previous works. The purpose of this 
chapter is also to illustrate the limitations of the SPM-like approach, which we 
apply to the same dataset, and the benefits of modeling spatial uncertainty. 

We conclude in Chapter 7 by a summary of the main results and discuss perspec- 
tives for future work. 

Finally, all the methods developed throughout this work where implemented in 
Python, and are now part of the NIPY open-source neuroimaging analysis package 
http://neuroimaging.scipy.org/site/index.html. Methods concerning the SPM-like 
approach have also been integrated into the fMRI toolbox of the BrainVisa soft- 
ware, developed at CEA/Neurospin http://www.brainvisa.info/index.html. 



Chapter 2 

fMRI group data analysis: the 
'SPM-like' approach 



2.1 Introduction 

The main goal of this chapter is to give a detailed account of the current approaches 
to fMRI group data analysis, with a focus on the approach primarily developed in 
[Priston, 1997] and popularized by the Statistical Parametric Mapping (SPM) software, 
which we will hence refer to in the following as the 'SPM- like' approach. It is the most 
widely used method to date, and has served both as the basis for our research, and the 
reference to validate our methods. 

This approach comprises two main steps. First, each subject's data is processed sepa- 
rately, resulting in a series of summary statistics, such as a statistical map of the subject's 
brain activity in response to any given contrast of experimental conditions. This step is 
described in Section 2.2, along with some basic facts on the nature of fMRI data. 

Next, the subjects' summary statistics are used as input data for group level analysis. 
In Section 2.3 we describe how this is accomplished in the 'SPM-like' approach, and 
discuss its advantages and limitations. 

Many alternatives have been proposed to overcome the limitations of the 'SPM-like' 
approach. We give an overview of the current literature on this subject in Section 2.6, 



fMRI Group Data Analysis: the 'SPM-like' Approach 



Image time-series Kernel 



Design matrix Statistical parametric map (SPM) 




^0 ^0 so 

Parameter estimates 




Gaussian 
field theory 



p <0.05 



Figure 2.1: Pipeline of single-subject data analysis using the SPM-like approach, 
from [Friston et al., 1995]. The data is realigned, normalized to a brain template rep- 
resenting a standard space, and smoothed. These pre-processings are described in 
Sections 2.2.1 and 2.3.1. Next, the time-series are modeled separately in each voxel us- 
ing the general linear model, whose parameters represent the amplitude of the BOLD 
response evoked by each type of stimuli (see Section 2.2.3). Active brain areas are 
detected by thresholding a map of test statistics, as explained in Section 2.2.4. 



then in Section 2.7 discuss the directions we have decided to follow in our work, and 
how these relate to the existing approaches. 



2.2 Single-Subject Data Analysis 



2.2.1 The blood oxygen level dependent effect 



Since the early nineties, functional magnetic resonance imaging (fMRI) has become one 
of the principal tools to investigate the functional organization of the human brain. 
This popularity is due to both the relatively good spatial resolution and low invasive- 
ness of fMRI compared to other functional imaging modalities. For instance, positron- 
emission tomography (PET) involves exposure to ionizing radiation; electroencephalog- 
raphy (EEG) or magnetoencephalography (MEG) have significantly less spatial reso- 
lution, and involve an inverse-problem for source reconstruction which has no unique 
solution. 
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Using fMRI, cerebral activity is measured indirectly, through its impact on the vascular 
network. More precisely, neural activity induces a local increase in blood oxygenation, 
known as the blood oxygen level dependent (BOLD) effect [Ogawa et al., 1990] (see 
Figure 2.2). Because of the magnetic properties of the oxyhemoglobin molecule (oxyHb), 
this effect can be measured by the scanner. One may note that this inherently limits the 
precision of this technique in locating activations, since the surge in blood oxygenation 
does not necessarily take place at the exact place of the neural activity. The link between 
these two phenomenons is still an active research area. 




HEM0GL08IIM 
OXYGEN ,, 




Figure 2.2: Illustration of the BOLD effect, from [Raichle, 1994]. (left) at rest (right) 
during an activation. Each neural activation induces a local increase in blood flow, that 
is substantially higher than the increase in oxygen consumption of the nearby neurons. 
This results in a fall in the concentration in deoxyhemoglobin (in orange), and a surge 

in the fMRI signal. 



2.2.2 Data acquisition and preprocessings 



Most fMRI paradigms are designed to identify brain areas involved in certain cognitive or 
sensori- motor tasks. They consist in a series of stimuli (visual, auditory, etc.) presented 
to the subject lying inside the scanner. The primary motive of this thesis lies in the 
analysis of data arising from such experiments. An alternative which has received much 
attention lately consists in scanning the subject in absence of external stimuli. The aim 
of studying such 'resting state' data is to identify latent attentional networks, e.g. using 
multivariate exploratory techniques [Beckmann and Smith, 2004, Perlbarg et al., 2008]. 

In all cases, the data resulting from a fMRI experiment consists in a sequence of three- 
dimensional (3D) brain images, with a typical spatial resolution of 3mm^ on a standard 
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3 Tesla (3T) scanner, acquired at the rate of one volume every 2 to 3 seconds. An 
anatomical image is also acquired using the same scanner, at a higher spatial resolution 
of about 1mm? planar resolution (inter-slice resolution may be lower). 

Before undergoing group analysis, this data is submitted to a number of pre-processing 
steps. These are briefly summarized hereafter (see [Friston, 1997] for further details). 
Pre-processing usually starts with the temporal realignment, also termed slice-timing^ 
of the successive scans to ensure that all time series inside the volume are sampled at 
the same time-points, even though the different slices are acquired at slightly different 
moments. Next, the scans are realigned in order to compensate for subject's motion 
inside the scanner. 

After temporal and spatial realignment, the data is rigidly registered to the associ- 
ated anatomical image of the same subject, and warped using a nonrigid transformation 
resulting from the registration of the anatomical image to a brain template, which repre- 
sents a standard stereotactic space such as the Talairach space [Talairach and Tournoux, 1988]. 
This so-called spatial normalization step has a key impact on group data analysis, as 
discussed in Section 2.3.1. Note that the different spatial transformations resulting from 
motion correction, functional/anatomical registration, and brain warping, are usually 
composed in order to prevent the data from being resampled more than once after slice 
timing correction. 

Finally, the data is spatially smoothed, in order to enhance the signal-to-noise (SNR) 
ratio, most often using a Gaussian kernel. The spread of this kernel, measured by its full- 
width at half maximum (FWHM), usually varies between 5 and 8 mm [Friston, 1997]. 

2.2.3 GLM analysis of time series 

The pre-processed data of a single subject is most often analysed separately at each voxel 
using regression techniques. This approach, often termed massively univariate because 
voxels are processed independently from one another, has been developed primarily in 
[Friston, 1997, Worsley et al., 2002], and is available in many software packages, such as 
Statistical Parametric Mapping (SPM) or the FMRIB Software Library (FSL). 
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Figure 2.3: Glover haemodynamic response function (HRF). 
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Figure 2.4: Example of a IMRI time-series observed in a certain voxel. Occurrence 
times (onsets) of the successive stimuli appear in red (time is measured in seconds). 



BOLD response modeling 

Let us denote Y^ = (Y^ i, . . . , Yk,T), the time-series observed at voxel A;, for /c = 1, . . . ,d, 
where d denotes the number of voxels within the search volume. This volume may be 
defined as the intersection of individual brain regions extracted from the functional 
images, or a unique region delineated in the template space. 

The time-series, illustrated in Figure 2.4, reflects the BOLD response at voxel k. The 
response to a series of stimulations of the same kind is traditionally modeled as a sta- 
tionary linear filter with finite impulse response called the HRF. This leads to a general 
linear model (GLM): 



X/3fc + £fc. 



(2.1) 
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Here X is the T x m design matrix containing the m regressors (one per column), and 
(3j^ the corresponding parameters of interest. Regressors include the constant vector ly, 
modeling a baseline activation, along with vectors modeling the BOLD responses to the 
different stimulations, also termed neural response levels (NRLs). These are defined for 
each given stimulus type as the (discretized) convolution of the HRF ^ by a function 
S representing the external stimulation. For event-related fMRI designs, S is a sum of 
Dirac masses specifying the occurrence times, or onsets of the successive stimuli, whereas 
for block-related fMRI designs, S is typically a sum of boxcar functions, specifying the 
beginning and the end of each stimulation block. 

Additional regressors may be added to model confounds, such as a low frequency drift 
due to physiological signals and scanner- related artefacts. This drift can be decomposed 
on a family of functions such as periodical (e.g. sine/cosine) functions or polynomials 
[Friston, 1997]. 

Noise modeling 

In (2.1), Ek S M'^ models the estimation errors, and is usually defined as a zero- mean 
Gaussian random variable. If its covariance matrix is spherical, i.e. if the estimation 
errors are assumed to be i. i. d. , with common variance o"^ , then the least square estimate 
and the maximum likelihood estimate of /3^ coincide. This common estimator is the best 
linear unbiased estimate (BLUE) of the model parameters, and is found by solving the 
linear system X'Yfc = X'X/3,^.. The variance af. is estimated in this case by the residual 
mean squares ^r^^ || Y^ — X/3j^, |p, where r is the rank of the design matrix X. 

However, fMRI time-series are known to be temporally correlated [Boynton et al., 1996, 
Zarahn et al., 1997]. Noting V^ the covariance matrix of the noise process St., the BLUE 
of the model parameters is now obtained by solving the following linear system: 

XX'Yfc = X'V^^X/3fc, (2.2) 

in the case where V^ is known. In general though, it is unknown and must also be 
estimated. As noted in [Donnet et al., 2006], this is an ill-posed problem since there are 
more parameters to be determined than available observations, so constraints must be 
imposed on Vfc. 
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Several approaches have been adopted. [Friston, 1997, Worsley et al., 2002] model e^ as 
first-order auto-regressive (AR(1)) process, so that at each time-point t = 1, . . . ,T: 

£k,t = Pk£k,{t-i) + ^k,t, (2.3) 

where p^ is a spatially varying autocorrelation parameter, and ^^ is a Gaussian white 
noise, with a spatially varying variance a'^. In [Friston, 1997], a simpler version of this 
model is used, where pk is assumed uniform across the search volume. In both cases, the 
variance parameters are estimated by restricted maximum likelihood (ReML), and the 
estimated covariance matrix V^ is substituted to the unknown true one in (2.2). The 
precision of the resulting 'plug-in' estimate of model parameters depends crucially on 
the accuracy of the covariance estimate V^. The study in [Friston and Buechel, 2000] 
suggests that the use of AR(1) with a global auto-correlation parameter may result 
in biased parameter estimates. Temporal smoothing is then suggested to improve the 
estimation. 

More sophisticated techniques to estimate V^ are compared in [Woolrich et al., 2001], 
including a Tukey taper combined with nonlinear spatial smoothing, which performs 
optimally among the tested strategies. Also, distinct auto-correlation parameters are 
estimated in each voxel, an option which is not considered in [Friston and Buechel, 2000], 
and may in part explain the observed bias. 

An alternative to the above plug-in estimation is to estimate the covariance and re- 
gression parameters jointly, by maximizing the full likelihood of the model. This is the 
approach developed in [Roche et al., 2004], where an AR(1) model is fit separately to 
the time-series acquired in each voxel. This is done iteratively, using an extension of the 
Kalman filter. This strategy further enables an 'online', or real-time, estimation of the 
model, the parameter estimates being updated as new scans are acquired. 

Generalization to higher- level auto-regressive models (AR(|?), with p > 1 ) is considered 
in [Penny et al., 2003], where a variational Bayes (VB) approximation is used to jointly 
estimate all model parameters. A very interesting feature of this approach is that it 
allows to select the order p of the auto-regressive model, separately in each voxel. This 
is done by maximizing the free energy J-{p), that is, the lower bound on the marginal 
likelihood central to the VB approach, easily available from the output of the VB algo- 
rithm. On an application to a particular real fMRI dataset, histograms of the optimal 
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values for the AR model order p show that it never exceeds p = 3.1n fact, in most voxels 
it is equal to either or 1. This provides a rough justification for the models described 
above, which all assume p = 1. 

Limits and alternatives 

We conclude this review on single-subject data modeling by noting that the formulation 
in (2.1) relies on the following key assumptions: 

• Fixed HRF. Several canonical shapes have been proposed for the HRF, such 
as the Glover HRF [Glover, 1999], used in SPM, defined as the difference of two 
Gamma functions (see Figure 2.3). This simplifying assumption may lead to a 
poor model fit in regions where the true HRF is significantly different from the 
one used in the estimation. 

• Linearity. The effects of the different stimuli are assumed to contribute additively 
to the overall effect evoked by the experimental paradigm, thus justifying the use 
of a linear model. This assumption may hold for an event-related paradigm if the 
inter-stimulus gap is long enough [Boynton et al., 1996, Glover, 1999]. Otherwise, 
non-linear effects may occur. 

• Time-Invariance. For a given stimulus type, the amplitude of the BOLD re- 
sponse is assumed constant across the successive stimulations, justifying the use of 
a single regressor for each stimulus type. This neglects so-called habituation effects, 
such as repetition-suppression, i.e. a decrease in the BOLD response observed 
when the same stimulus is presented repeatedly. Modeling such phenomenons 
using the standard GLM requires to incorporate additional regressors, hence re- 
ducing the degrees of freedom. Thus parametric modulation may be preferable, as 
discussed below. 

The modeling and analysis of brain haemodynamics is still a field of active research. 
To cite only a few examples, [Friston et al., 2000] suggested modeling non linearities in 
the BOLD response through Volterra series while remaining in the GLM framework, by 
adding additional regressors, defined as temporal derivatives of the canonical HRF. To 
overcome the limitations inherent to the use of a canonical HRF, [Penny et al., 2003, 
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Woolrich et al., 2004b] modeled the HRF as a linear combination of a pre-defined set 
of functions. In [Makni et al., 2008], a joint detection-estimation (JDE) framework is 
developed to estimate at the same time the shape h of the HRF, in a nonparametric way, 
and the NRLs f3, in a Bayesian setting. In [Donnet, 2006] an alternative approach to the 
statistical modeling of fMRI time series is investigated, based on the physiological model 
of the blood flow / oxygen consumption coupling proposed in [Buxton and Frank, 1997]. 

Another limitation, inherent to the mass univariate model, is that it ignores the spatial 
structure of the data. Several authors, including [Gossl et al., 2001, Penny et al., 2007, 
Smith and Fahrmeir, 2007, Makni et al., 2008], have proposed to model the NRLs, ac- 
cording to a spatial mixture model, with three classes, corresponding to null, activated 
and deactivated (inhibited) voxels. The labels identifying the state of each voxel are 
further modeled jointly as a hidden Markov random field, which tends to group active 
voxels in clusters. This has a regularizing effect, since isolated voxels, assimilated to 
noise, are less likely to be detected. Such spatial mixture models have also been pro- 
posed to model directly the statistical maps resulting from the GLM analysis, as a means 
to compute a detection threshold (see Section 2.6.1 for a discussion of these methods). 

2.2.4 Statistical map of brain activity 

Based on the model (2.1), an effect of interest, such as a difference between stimuli, 
can be specified by c(3j^, where c is a row vector of m contrasts [Worsley et al., 2002]. 
Detecting a nonzero effect is then equivalent to testing the null hypothesis that the effect 
is zero, Tio^k '■ c/3^ = 0. Assuming the data has been pre-whitened as described in the 
previous section, so that its covariance structure is approximately spherical, it follows 
from the Neyman-Pearson lemma that the optimal statistic for testing ^o,fe is Student's 
i-statistic: 



n = — ; ""^^ (2.4) 



<7fcA/c(XfcXfe)-lc' 



_ 1 
where X^ is the pre-whitened design matrix given by V^ ^X^, given the covariance 

matrix V^o"? of the data Y^. If V^ is known, then under "Hok, Tf^ follows a Student 
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Figure 2.5: Individual <-score map for the 'audio-video' contrast, from the Locahzer 
dataset [Pinel et al., 2007], with the the subject's anatomical image in the background. 
The map is thresholded at 10"'^, uncorrected, meaning that the presence of an ac- 
tivation is tested independently in each voxel at level 10~^, without accounting for 
multiple comparisons. Activations are clearly seen in the bilateral temporal regions; 
many isolated voxels are also detected, which may be false positives. 



distribution with T — (m + l) degrees of freedom (df). Consequently, 'Ho,k is rejected at 
level a ii Tk > ta, where ta is the (1 — a)-th quantile of the Student distribution with 
T — (m + 1) df. In practice V^, is estimated from the data, as explained in Section 2.2.3, 
so the test level is not exactly a. 

Computing Tk for all voxels k = 1, . . . ,d results in a statistical map, also termed activa- 
tion map, reflecting the subject's activation pattern for the given contrast c. Detecting 
activated regions in this setting is equivalent to testing simultaneously all the null hy- 
potheses ^o,fc) the number of which is the number of voxels in the search volume and is 
typically of the order of 100 000. 

To do so, one must address the multiple comparison problem. For instance, the naive 
procedure that consists in rejecting 'Ho,fc for all voxels k such that Tk > ta, as described 
above, choosing for instance a = 0.001, ensures that the probability of a false detection 
in each voxel is at most 0.001. However, this means that, in the worst-case scenario 
where all voxels are in fact inactive, an average of a x 100 000 = 100 false positives are 
detected throughout the volume. This issue is illustrated in Figure 2.5. 

The same multiple testing problem arises when detecting group activation pattern using 
the 'SPM-like' approach, as we will see in Section 2.3.3. Thus, we postpone this issue 



fMRI Group Data Analysis: the 'SPM-like' Approach 17 




Figure 2.6: coronal and axial slices of the MNI template, from [Flandin, 2004]. This 

is obtained as the average of 152 individual anatomical images, registered to an earlier 

version of the template (MIN305) using affine transformations. 



to Section 2.4, where we adress it jointly for individual and group data analysis. 

2.3 Group analysis: The mass univariate approach 

In a typical fMRI study, several subjects are recruited from a population of interest 
and scanned while submitted to the same series of stimuli. Activation maps associated 
with a given contrast are obtained for each subject, as described in the previous section, 
and used as input data for inference at the between-subject level, where the goal is to 
evidence a general brain activity pattern. 

In the following, we describe how this is performed in the 'SPM-like' approach, starting 
with the spatial normalisation step in Section 2.3.1, which aims to match each individual 
image to a brain template. The activation maps are then compared on a voxelwise basis, 
as described in Section 2.3.2, resulting in the computation of a map of statistics, detailed 
in Section 2.3.3, to test in each voxel the presence of a positive mean activation across 
subjects. 

2.3.1 Spatial normalisation and smoothing 

The high morphological variability of the human brain [Brett et al., 2002], makes the 
comparison of cerebral images across subjects problematic. 
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Figure 2.7: Illustration of spatial normalisation on a sagittal slice: (top) im- 
ages, before normalisation; (bottom) after normalisation, using affine transformations. 
Anatomical differences across subjects are seen to persist, especially in the sulco-gyral 

geometry. 



A traditional way to compensate for this variability, as mentioned in Section 2.2.1, 
is to register, or normalize, the individual images of all subjects to a common brain 
template [Ashburner and Friston, 1999], such as the widely used Montreal Neurological 
Institute (MNI) template (see Figure 2.6). This is usually done by minimizing a measure 
of discrepancy between image intensities over a suitable class of spatial transformations 
(see Chapter 4 for a brief review on registration methods). Comparative studies of 
several normalization methods can be found in [Hellier et al., 2003, Klein et al., 2009]. 

Any location in the brain can then be marked in a standard coordinate system, such as 
the one developed by [Talairach and Tournoux, 1988]. However, registration is prone to 
errors (even assuming the existence of point-to-point correspondences between different 
brains) , hence it does not seem reasonable to assume that homologous points are exactly 
aligned across subjects. 

This fact is often used as a motivation to justify a preliminary linear spatial smoothing 
of the data, with a typical FWHM of 8 to 12 mm, as a way to increase the overlap of 
functionally homologous regions over subjects. This smoothing is sometimes applied 
to the individual activation maps, resulting from the individual data processing, in 
addition to the first smoothing step used on the raw fMRI time-series to enhance SNR 
(see section 2.2.1). 

The 'SPM-like' approach then compares the individual images on a voxelwise basis, thus 
making an implicit assumption that each subject is in perfect match with the template. 
Among the consequences of that assumption, one may anticipate a stretching effect on 
group activity patterns due to the "jitter" induced by inaccurate registration. This effect 



fMRI Group Data Analysis: the 'SPM-like' Approach 



19 



l&: /S^ 





1 2 3 4 5 6 7 B 9 10 11 12 13 14 

Subject's number 

Figure 2.8: Example of fMRI group data in one voxel, during a language processing 
task, from [Mcriaux ct al., 2006]. Left, activations detected using the one-sided mixed- 
effect statistic, thresholded using a permutation test (see Section 2.3.3), with cross-hair 
at (60; 15; 6) Talairach coordinates in mm. Right, plot of the estimated effects in the 
same voxel and associated 70% confidence intervals 



can only be reinforced by the above-mentioned preliminary smoothing step. Evidence 
for this stretching is provided in [Keller et al., 2008], and will be one of the important 
results of Chapter 4. 

2.3.2 Between-subject modeling 

Let Xj = (xj^i, . . . , Xi,d) denote the map of BOLD effects in response to a certain contrast 
of experimental conditions, for subject i = 1 . . . , n. As seen in Section 2.2.4, a noisy 
estimate of Xj, y^ = (yj^i, . . . ,yj,d) is available from the analysis of the subject's scans, 
along with an image of estimation variances Sj = (sf i, . . . , s^^). 

More precisely, for each subject i, following the notations introduced in Section 2.2.4, 
we define in each voxel k : 



i,fc := c/3fc; Vi^k ■■= cPk, Sj_fc := (7feyc(XfcXfe) ^c' 



for a certain row contrast vector c. 



Under sufficient degrees of freedom at the within-subject level, it is reasonable to consider 
yi k as being normally distributed around xt k with standard deviate Si k considered fixed 
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[Worsley et al., 2002]. To address questions regarding the variability of the effect in a 
population, the unobserved effects xi^k, ■ ■ ■ ,Xn,k are further modeled as independent 
random variables drawn from an unknown distribution which characterizes the across- 
subject variability of BOLD responses. When this distribution is assumed Gaussian 
with unknown mean and variance (/Ufc, o"^), we obtain the same hierarchical model as in 
[Worsley et al., 2002, Beckmann et al., 2003a, Meriaux et al., 2006]: 

• First level (within-subject): 

yi,k = Xi^k + ^Lk] Si^k'"^' ^fiO,slk), (2.5) 

• Second level (between-subject): 

Xi,k = fJ'k + Vi,k; rji^k"^' -^{0,(^1)^ (2-6) 

where the independence sampling assumptions at both levels imply that in each voxel k, 
the pairs (xi^fc, yi,fc), . . . , {xn,k, yn,k) are mutually independent conditionally on the pop- 
ulation parameters (/i^, (t|). By integrating out the hidden variables Xj fc, we see that the 
observed effects are drawn independently but, in general, non-identically from Gaussian 
distributions: 



yi,k = Xi^k + ii,k'i Ci,fc *~' AA(0,s-_fc + crfc). (2.7) 

That is to say, the observations are generally heteroscedastic unless all first-level devi- 
ations Sj^fc are equal. In this special case, the model boils down to the simple sampling 
model in [Friston et al., 1995], that is computationally attractive but may lack robust- 
ness against noisy observations . 

We conclude this description of the standard model for fMRI group data with the fol- 
lowing remarks: 

• As for individual subject data (see Section 2.2.3), the group data is modeled sep- 
arately in each voxel, in a 'massively univariate' fashion. Consequently, possible 
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correlations between neighboring voxels are ignored at this stage. They will be 
accounted for in the multiple comparison step, discussed in Section 2.4. 

• The within-subject model (2.5) is also referred to as a fixed-effect (FFX) model, 
since it specifies the unobserved effects xi^k, ■ ■ ■ , Xn,k as the (fixed) parameters of in- 
terest. Inference in this model is therefore limited to the cohort of subject scanned 
during the experience. Likewise, the between-subject model (2.6) is referred to as 
a random-effect (RFX) model, since it considers the effects xi^k, ■ ■ ■ ,Xn,k as ran- 
dom variables. Finally, the hierarchical model specified by (2.5) and (2.6) is also 
called a mixed-effect (MFX), or mixed, model, since it 'mixes' the FFX and RFX 
models. 

• As noted in [Worsley et al., 2002], besides simply inferring on the mean population 
effect ;Ufc, we may also wish to compare two or more populations, and more generally 
regress the subjects' effects Xi^k on a certain set of regressors Zi^k- To this end, the 
mean effect fj.^ in (2.6) may be replaced by a linear term z'- ^ /x^ in the regressor 
variables. Though in the following we will focus on the one-sample setting for the 
sake of simplicity, the methods exposed here are fully adaptable to the general 
regression setting, except when mentioned otherwise. 

Several extensions to the standard hierarchical model specified by (2.5) and (2.6) have 
been proposed. In [Woolrich et al., 2004a] for instance, the uncertainty on first-level 
standard deviates Sj^fc, is accounted for. More specifically, the general linear model (2.1) 
used to estimate the subject-specific parameters is directly used at the within-subject 
level, rather than the Gaussian approximation (2.5). 

Based on this more realistic model, a Bayesian approach is developed wherein the pos- 
terior distribution of the group mean effect, p(/iA:|yi,fc, •••, yn,k)-, is estimated, either 
using MCMC techniques, or a faster approximation, assimilating it to a noncentral t- 
distribution. 

Such an approach may seem computationally intensive at first, since the complete data of 
all subjects must be analyzed together. However, it is shown in [Woolrich et al., 2004a] 
that the posterior distribution of yU^ depends in fact only on certain summary statistics 
of the individual datasets. These can be computed beforehand, hereby greatly reducing 
the computational complexity of this approach. 
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Some authors have further proposed to relax the Gaussian assumption at the between- 
subject level. In [Roche et al., 2007], the distribution of the Xj^^'s is assumed totally 
unknown, and a nonparametric maximum likelihood estimation (NMLE) method is de- 
veloped. In [Woolrich, 2008], a mixture model consisting of two Gaussian distributions 
is used to describe population heterogeneity, one class corresponding to possible outlier 
subjects. 

2.3.3 Test of a nonzero group effect 

Based on the hierarchical model introduced in the previous section, our final goal is to 
test in each voxel k the presence of a nonzero mean effect, i.e., test 'Hq./c '■ f^k = ^ versus 
^i,fc : Aifc 7^ 0. Besides this two-sided null hypothesis, one may also wish to test the 
presence of a positive effect (corresponding to an activation), that is, test the one-sided 
null hypothesis ^o,fc '■ IJ'k ^ versus "Hi^k '■ fJ-k > 0. This test involves the two following 
steps: 

• Definition of a test statistic T^ = Tk{yi^k, ■ ■ ■ , yn,k) 

• Statistical calibration of T^, i.e. computing the threshold u such that P{Tk > 
^1^0, fc) ^ CK) for any required level a. The level specifies the probability of falsely 
rejecting 'Ho,fc, i-e., the probability of a type I error, or false positive. 

Choice of a test statistic 

In the simple RFX model, i.e., when the first-level standard deviates Si^k are implicitly 
assumed constant, as in [Friston et al., 1995], a natural choice is the standard i-statistic 



Tk- ^' 



std{yk)/Vn- 1' 



where y^ = ^ YA=iyi,k and std{yk)'^ = ^ E"=i(yi,fe - Ykf- This yields the uniformly 
more powerful (UMP) test at any level a, both for the one-sided and the two-sided test 
[Lehmann, 1986]. 
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In the general MFX model, there is no optimal choice of a test statistic in terms of power. 
In [Worsley et al., 2002], a'^ is estimated by restricted maximum likelihood (ReML), us- 
ing an expectation-maximization (EM) algorithm [Dempster et al., 1977]. This is equiv- 
alent to an iterative reweighted least-square procedure to estimate /i^, using the weight 
(s^^ + a^)^^ for observation yi^k- In analogy to the RFX case, the following approximate 
t-statistic is then used: 



OkNn 
A similar approximate i-statistic is used in [Woolrich et al., 2004a], where (nki'^k) ^-re 
estimated in a Bayesian setting by their posterior mean, having marginalized out all other 
hidden variables using a Monte-Carlo Markov-Chain (MCMC) sampling algorithm. 

Another, more systematic, approach, advocated in [Meriaux et al., 2006, Keller and Roche, 2008], 
is to use the maximum likelihood ratio (MLR) for the two-sided test: 

sup £fc(/"fc,o-fc) 
sup Ck{Hk,crl)' 



tj-ki^o, CT^e 



k'^'-'-+ 



where Ck{nk, crl) ^^ ^^^ likelihood of the model (2.7). For the one-sided test, the following 
sign modulation is used: 

ffc = sign{jlk)\/Rk- 

The maximum likelihood estimates {fikT^k) can be computed using an EM algorithm. 
In [Roche et al., 2007], this approach is extended to the nonparametric setting, and it 
is shown that the nonparametric maximum likelihood estimate of the between-subject 
distribution is a combination of at most n Dirac masses. 

Statistical calibration 

Voxelwise statistical calibration is straightforward in the simple RFX model in [Friston et al., 1995], 
using the fact that, under 'Ho,k, Tk follows a Student distribution with n — 1 degrees of 
freedom. The two-sided null hypothesis ?^o,fc : /^fc = is then rejected if |Tfc| > to/2i 
where t^^rt is the (1 — a/2)-th quantile of the t„_i distribution, and the one-sided null 
hypothesis ?^o,fc : /^fe < is rejected if Tk > ta- The Student distribution is also used 
as a substitute for the unknown null distribution of the approximate f-statistics in 
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[Worsley et al., 2002] and [Woolrich et al., 2004a] (see the previous section), using the 
same degrees of freedom as the standard t-statistic. Use of the Student distribution is 
vahd in both cases if the data distribution is Gaussian, or for large sample sizes (n — )• oo), 
owing to the central limit theorem. 

An alternative solution to this calibration problem is to use permutation tests [Good, 2005]. 
They allow exact control on the type I error, under mild assumptions on the sampling 
distribution, and for any choice of a test statistic. This method of calibration was intro- 
duced in the neuroimaging literature by [Holmes et al., 1996]. It consists in sampling 
the null distribution of the test statistic by permuting the data, under certain exchange- 
ability hypotheses. Having sampled A^ values, the threshold of the test is then simply 
equal to the [Aa]-th largest sampled value. 

The principal limitations of permutation tests are the heavy computations they require, 
and also their limited applicability. For instance, the universally exact control on false 
positives does not extend to the test of partial correlations in a multiple regression 
model. Approximate permutation testing procedures have been proposed in this case 
[Anderson and Legendre, 1999, Cade, 2005]. They have been found empirically to pro- 
vide a more precise control over false positives than standard parametric tests, in certain 
cases where the assumptions underlying the latter are not verified. However, no general 
result exists to support these observations. 

2.4 Multiple comparisons 

A multiple comparison problem arises when testing several voxels simultaneously. This 
is the same problem encountered in the analysis of individual subject data (see Sec- 
tion 2.2.4). In its simplest form, it consists in finding a detection threshold u for a given 
statistical map T = (Tfc)i<j!;<rf such that the balance between specificity (control over 
false positives) and sensitivity (control over false negatives) is optimal in a certain sense. 
Appendix A.l gives a precise definition of these concepts. Furthermore, the chosen mul- 
tiple comparison procedure (MCP) must also produce results that are useful in terms of 
neuroscience. Thus the detected activations must be easily linked to known anatomical 
structures. Moreover, a stringent control over false positives is usually required to avoid 
erroneous interpretations. 
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MCP's in neuroiinaging can be divided into two main categories. The first one contains 
voxel- level thresholding procedures, that are presented in Section 2.4.1. However, due 
to the strict control on false positives, and the large number of tests performed (up to 
~ 100 000 voxels in a whole brain image), these procedures typically lack sensitivity. 
Another shortcoming is that they detect individual voxels, which are hard to interpret, 
as they do not constitute relevant biological units. 

Another type of MCP's has been developed to overcome these limitations, which we will 
refer to as cluster-level procedures. First, the statistical map is thresholded as before, 
but this time at an arbitrary level. Next, connected components, or clusters of voxels 
above this threshold are identified. Finally, the presence of activations is tested within 
each cluster, using a secondary threshold on the cluster's size. Cluster-level procedures 
are in general more powerful than their voxel- level counterparts, since they entail much 
less hypotheses to be tested. A review of these approaches is presented in Section 2.4.2, 
and we conclude by discussing the pro's and con's of the different methods. 

2.4.1 Voxel-level inference 

We now address the problem of choosing a detection threshold u for a map of voxelwise 
test statistics T = (Ti, . . . ,Tii), controlling a certain error rate. The problem can also 
be defined as that of finding a threshold c for the map of p- values {pi, . . . ,Pd)- Here T 
may stand either for the activation map of an individual subject (see Section 2.2.4), or a 
group activation map (see Section 2.3.3). In both cases, the null hypotheses considered 
here, noted Tik = in Appendix A.l, are the voxelwise null hypotheses Tio^k- 

FWER-controUing procedures 

A lot of attention has been given to the control of the FWER, that is, the probability 
of one or more false positives, in the neuroimaging literature, as a strict control on false 
positive is necessary in order for the detected activated brain areas to be reliable. We 
review here some of the main approaches. We refer to Appendix A.l for a rigorous 
definition of this and other multiple test error rates. 
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The Bonferroni procedure. The Bonferroni procedure consists in rejecting every 
null hypothesis ?^o,fc whose p-value is smaller than a/d, where a is the desired upper 
bound on the FWER, and d the number of tested hypotheses (one per voxel). Its 
justification relies on no other assumption than subset pivotality, necessary to define the 
p- values (see Appendix A.l). It is a direct application of Boole's inequality, and the fact 
that each pk is uniformly distributed under Tio^k '■ 



FWER = P[V > 0\Hmo] 

= Pi^keMoiPk < a/d}\'HMo] 

< Yl P'^Pk < a/d\no,k] 

keMo 
= do X a/d < a. (2.8) 

In the above derivation, do = tlA^o is the number of true null hypotheses. It can be 
shown that, under independence of the data across voxels, the Bonferroni procedure is 
near optimal [Ge et al., 2003]. This means that there is no procedure significantly more 
powerful having the same level of FWER strong control. 

The maxT Procedure. In the case of mutually dependent tests, the more general 
maxT procedure can be applied. It relies on the fact that, under the global null Hm: the 
probability of one or more false detections can be controlled knowing the null distribution 
of the maximal statistic: 



P[V>0\Hm] = P[3keM,n>u\nM] 
P maxTfc > uIT-Lm 

k&M 



(2.9) 



Thus, to control P[V > 0|^x] at a given level a, the threshold u must be equal to 
the (1 — a)-th quantile of the distribution of the maximal statistic maxfcg_A4 T^. under 
the global null Hm- Note that this method gives weak control on the FWER. Strong 
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Figure 2.9: Excursion sets of 3D random fields, from [Taylor and Worsley, 2007]. 
(a) Ball and torus. The corresponding Euler characteristic is: x = 2 — 1 + = 1. (b) 
Isotropic Gaussian field, with zero mean and variance one, above a threshold t = —2, 
corresponding to x = 6 (unseen hollows contribute +1 each) (c) t = 0, x = —6 (handles 
dominate) (d) t = 2, x = 14 (handles disappear) (e) i = 3, x = 1- 



control follows under subset pivotality [Westfall and Young, 1993]. The justification, 
elementary, is given in Appendix A. 



Computing Tail Probabilities. The main difficulty in applying the maxT princi- 
ple is the computation of the tail probabilities P [max^g^vt ^fc > u\T~{-m] ■ These depend 
strongly on the definition of the test statistics T^, and on the distribution of the un- 
derlying data. Hence, many different approaches have been proposed, that we briefly 
review here. A detailed study can be found in [Nichols and Hayasaka, 2003]. 

Parametric approximations were introduced in [Worsley, 1994]. They essentially equate 
the tail probability P [maxfcg_A/( T^ > ulTijn] with the expectation of the Euler charac- 
teristic X of the excursion set {k £ Ai,Tk > u}. For a 3D excursion set, x is essentially 
the number of connected components, minus the number of 'handles', plus the number 
of 'hollows' (see Figure 2.9). For a high threshold u, the excursion set is expected to 
be either empty, or composed of a single cluster with no holes, hence the tail proba- 
bility is approximately equal to £'[x|A^]- This last expectation can be estimated using 
inequalities from random field theory (RFT), that are based on an estimation of the 
smoothness of the spatial map T. In practice, these approximations yield satisfying re- 
sults for smooth maps, but are overly conservative for non smooth maps, where the 
Bonferroni correction is optimal [Nichols and Hayasaka, 2003]. An approach to bridge 
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this gap is developed in [Worsley, 2005, Taylor et al., 2007, Taylor and Worsley, 2007], 
using improved Bonferroni-type inequalities based on the discrete local maxima (DLM) 
of the statistical map. These lead to bounds on the tail probability evaluated using 
RFT-type approximations, that are shown to be near optimal at all smoothness levels. 

The main drawback of RFT-based approximations is that they rely on heavy parametric 
assumptions, which are hard to verify in practice, such as the stationarity, Gaussianity 
and smoothness of the statistical map. They are also restricted to a certain class of test 
statistics, such as t or i^-statistics. Finally, their domain of validity is hard to determine, 
since they assume a 'high' threshold u, but it is not clear what this means in practice. 

In the context of group data analysis, it is possible to avoid these issues by using a permu- 
tation test to tabulate the null distribution of maxfcg_v( T^ [Holmes et al., 1996]. When 
applicable, this approach has many advantages: It is valid under minimal assumptions 
concerning the distribution of the data; it can be applied to any choice of the test statistic 
Tfc ; finally it is near optimal in terms of statistical power [Nichols and Hayasaka, 2003] . 
Despite of all these virtues, permutation testing has not yet replaced RFT techniques as 
a standard for fMRI group data analysis, though both are available in SPM and FSL, 
the most used software packages to date for the analysis of fMRI data. This may in part 
be explained by the important computation time they require, and also by the limited 
range of questions they allow to answer. Indeed, as mentioned in Section 2.3.3, their 
exists no generally exact permutation test of a partial correlation in a multiple regression 
model. 

FDR-controlling procedures 

In spite of the abundant literature devoted to voxelwise FWER-controlling procedures, 
their application remains limited by their lack of power. This is due both to the strict 
control they impose on false positives, and to the large number (up to a hundred thou- 
sands) of voxels, and therefore of tested hypotheses, present in brain activation maps. 
More recently, there have been some attempts to control the less stringent FDR crite- 
rion instead in order to gain statistical power. This is illustrated in Figure 2.10, where 
different error rate controls are compared for a same activation map. 
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The most famous procedm'e for controlling the FDR is the Benjamini-Hochberg proce- 
dure [Benjamini and Hochberg, 1995]. It consists in sorting the p- values in increasing 

order: pn\ < . . . < P(d)- They are then compared to the linear series I —a I , where 

V" J l<k<d 
a is the desired FDR level. Next the following index is computed: k* = max{l < k < 

k 
d, P(fc) < -;a}- If for all k, p(^f,^ > ^a, then no null hypothesis is rejected. Otherwise, 

all null hypotheses Tio^k for k = 1, . . . ,k* are rejected. [Benjamini and Hochberg, 1995] 

showed that, under independence of the p- values, this procedure had strong control over 

the FDR, i.e., 

FDR = E 






< —-a < a. 
a 



This proof, along with the simplicity of the algorithm, greatly contributed to popularize 
this approach. 

Several works have been concerned with the use of FDR for neuroimaging data, such 
as [Pacifico et al., 2004], which proposes an extension of of the BH approach to random 
fields, based on an estimation of the field's smoothness similar to the one developed 
in [Worsley, 1994]. In [Nichols and Hayasaka, 2003], was expressed the belief that FDR 
would soon replace FWER as a standard type I error rate in fMRI data analysis. How- 
ever, relatively few applications have been published to date based on the FDR, though 
the BH procedure has been included in the reference SPM software. To this we see 
two possible explanations. The first one is that the BH procedure can be unstable with 
respect to the p-values, which is a problem when analyzing fMRI datasets, known to be 
very noisy. Hence, while allowing better statistical power, FDR-controlling procedures 
are also liable to produce more false positives, without being able to discriminate them 
from true positives, as illustrated in Figure 2.10. Another one is that, as mentioned 
earlier, voxels do not constitute relevant units from a cognitive point of view. In a 
broader sense, this may also be the main reason why voxel-level inference methods are 
not mainstream in neuroimaging papers concerned with the applications. 

2.4.2 Cluster- level inference 

We now turn to cluster-level multiple testing, which is the most widely used approach 
to detect activated regions given fMRI activation maps. Using the same notations as 
above, the statistical map (Ti, . . . , T^) is first thresholded at a certain height u. However, 
in this context u does not serve to test the voxelwise hypotheses Tio^k- Rather, it is used 
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Figure 2.10: Voxel-level error rate controls compared on the same activation map. 
Left: FPR controlled at 0.001, using voxelwise tests. Middle: FDR controlled at 0.05 
using the BH procedure. Right: FWER controlled at 0.05 using the Bonferroni proce- 
dure. The FDR control enables to detect substantially more voxels than the FWER 
control at the same level. However, 5% of voxels detect by this approach, i.e. 118 in this 
example, are expected to be false positives. In contrast, less than 60 false positives are 
detected on the average using a simple voxelwise test (left), for a qualitatively similar 

result. 



to identify connected components, or clusters, from the excursion set {k G M,T^. > u}. 
Thus it is referred to as a cluster- defining threshold, and is a fixed, user-specified quantity. 
The choice of u is arbitrary; a popular heuristic is to tune it in order to control the 
voxelwise FPR at a certain level a, that is, allow a proportion a of null voxels to be 
retained at this stage. 

Next, the presence of an activation is tested within all detected clusters Ci, . . . ,Cm- 
More precisely, for each cluster d, the null hypothesis: l-id = flfcec^o.fc is tested, 
against the alternative l-Ld = Ufcec^i.^- ^^^ cluster null hypothesis is equivalent to 
stating that Ci is enclosed in the null set A^o defined in Section A.l: T-ia = {Ci ^ -^o}- 
The alternative hypothesis T-Ld states that at least one voxel in Ci is activated. 

The decision statistic used to test each cluster-level hypothesis is generally the cluster 
size jijCj, though many other choices are available, as discussed in Section 2.6.1. Hence, 
the null hypotheses T-Ld are rejected for all clusters whose sizes exceed a critical value 
N, tuned to control a certain error rate, that is, for clusters that would be 'unusually 
large' in absence of any activation. 

As in voxel- level inference, cluster-level multiple comparison procedures are usually re- 
quired to have strong control over the FWER, which in this case is the probability of 
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detecting one ore more clusters by mistake: 

FWER = PlVyOlnMo] 

= P[3CiCMoJCi>N\nMo]- 

Following the maxT principle (see Section 2.4.1), this is usually done by controlling the 
tail probability of the maximum cluster size, under the global null TIm^ since the true 
subset of null hypotheses is unknown. This means tuning N so that 



P 






< a. (2.10) 



This procedure only provides weak control on the FWER; strong control follows under 
subset pivotality [Holmes et al., 1996]. The proof of this result is given in Appendix A. 

Computing the tail probabilities of null distribution of the the maximum cluster size 
maxc- jjCj is the principal difficulty of this approach. The principal methods for doing 
so are reviewed in [Hayasaka and Nichols, 2003]. As previously, parametric approxima- 
tions based on RFT theory are available [Worsley et al., 2002], as well as exact calibra- 
tion based on permutation tests, if T is a group activation map. These have the same 
advantages and drawbacks as in the voxel-level setting (see Section 2.4.1). To summa- 
rize, though easy and fast to implement, the RFT approach relies on heavy parametric 
assumptions, and may be overly conservative is the statistical map is not smooth; the 
permutation testing approach is exact under very mild nonparametric assumption, and 
is always near optimal [Hayasaka and Nichols, 2003]. Its only drawback is that it is 
computationally intensive. 

Cluster-level inference as described above has several key advantages over voxel-level 
inference. First, the number M of clusters is much smaller than the number d of voxels. 
Thus, testing cluster-level hypotheses greatly reduces the multiple comparison problem; 
this explains why cluster detection is in general more powerful than voxel detection. 
Another advantage is that results are reported in terms of regions rather than voxels, 
and are therefore easier to interpret from a cognitive point of view. The detected clusters 
may be related to known anatomical regions based on expert knowledge, or using a digital 
brain atlas such as the Automated Atlas Label (AAL) [Tzourio-Mazoyer et al., 2002]. 
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Figure 2.11: Clusters detected at different cluster- forming thresholds. Axial slices 
z = 37niin in Talairach, represented with the subjects' mean anatomical image in the 
background). From left to right, the threshold is tuned to control the false positive rate 
(FPR) respectively at 10^^, 10~^ and 10~^ uncorrected. Each cluster surviving the 
FWER controlling-threshold at 5% is represented with a specific color, showing how 
distinct functional regions are merged. In particular, left and right hemispheres are not 

segmented. 



The principal disadvantage of cluster-level inference wish respect to voxel-level inference 
is its dependence on the cluster-defining threshold, which ultimately defines the detected 
regions. The fact that a suprathreshold cluster is found significant by the cluster size 
test only implies that it contains some active voxels [Hayasaka and Nichols, 2003]. Low 
values of the cluster-forming threshold may result in merging functionally distinct re- 
gions, thus yielding poor localization power, while high values may result in missing 
active regions. This is illustrated in Figure 2.11. 

2.5 Limits of the SPM-like approach 

The 'SPM-like' approach has been summarized in Sections 2.2, 2.3 and 2.4. In the 
following, we will assume that activations are detected at the cluster-level, as described 
in Section 2.4.2. Though simple, and widely applicable, this approach suffers from 
several important limitations, which we summarize here: 



Arbitrary cluster-forming threshold. As noted in Section 2.4.2, clusters are de- 
fined for a certain cluster-defining threshold. The choice of this threshold is arbitrary, 
even though it is crucial for the resulting inference. 
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Exclusive control of false positives. Error rates are controlled by maximizing the 
statistical power of the tests while limiting a certain type I error rate at a given level 
a (see Section A.l). Consequently, there is no direct control on the amount of type II 
errors, or false negatives, meaning that the absence of activations outside the detected 
clusters cannot be assessed, and that there is no guarantee that the whole functional 
network can been recovered. This partly explains the poor reproducibility of group 
analyses across datasets [Thirion et al., 2007a]. 

Assumption of perfect match bet^veen individual brains. Due to unavoidable 
inter-subject registration errors (see Section 2.3.1), the observed activations are not 
well-localized, and possibly displaced across distinct functional regions, which may 
result in blurring the group activation map and creating unhandled false positives 
[Keller et al, 2008]. 



2.6 Alternative approaches 

Many approaches have been developed to overcome the limitations identified in the 
previous section, which we review now. Research on this subject has been conducted in 
different directions, which we have chosen to identify as follows: 

To start with, many efforts have been devoted to develop new thresholding techniques 
to address the multiple comparison problem, both at the voxel and cluster level, and 
are discussed in Section 2.6.1. Their goal is to overcome the limitations of the standard 
MCPs, such as the exclusive control over false positives, or the dependence on a cluster- 
forming threshold. 

Next, to define potentially active regions, as a natural alternative to suprathreshold 
clusters one may use pre-defined regions of interest (ROIs), related to the question to 
be answered by the fMRI experiment. For instance, ROIs may be defined as anatomical 
structures that one expects to be functionally involved in a certain target task. This 
strategy avoids the problematic choice of a cluster-forming threshold, and provides a way 
to incorporate prior informations on the regions involved in the task at hand. However, 
the use of such fixed regions poses several challenges, which we describe in Section 2.6.2. 
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An increasingly popular approach, termed surface-based analysis, aims at improving 
the localization of each subject's activity. It consists in projecting for each subject 
the fMRI data on the cortical surface, segmented from the anatomical images. Spatial 
normalisation is also performed on these surfaces rather than on the whole brain volume. 
We describe several procedures for surface-based analysis, and discuss their pro's and 
con's in Section 2.6.3. 

Finally, so-called feature-based approaches represent a very fruitful line of research. They 
consist in extracting from the individual data certain high-level features, such as local 
maxima of an activation map, and comparing them across subjects. By performing the 
group analysis at a higher level than the voxel-level, such approaches naturally reduce the 
multiple comparison problem, and also provide a way to deal with spatial uncertainty. 
Several methods are presented in Section 2.6.4. 

2.6.1 Thresholding techniques 

We now review a few papers which are representative of the directions which have been 
explored to ameliorate standard voxel and cluster-level thresholding techniques. 

Joint control over false positive and false negative risks 

Voxel-level Bayesian inference Voxel-level inference methods have been proposed 

that control both false positive and false negative rates. In [Friston et al., 2002b, Friston et al., 2002a], 

the usual voxelwise test is extended using Bayesian inference. Thus, the voxelwise p- 

values pk are replaced by the posterior probabilities q^ that the mean group effect /x^ is 

larger than a certain baseline b. The method implemented in [Friston et al., 2002b] has 

several drawbacks however. First, the meaning of the baseline b is unclear, and so is its 

choice. Second, the Bayesian inference proposed in [Friston et al., 2002b] is solely aimed 

at voxel detection. In particular, it provides no equivalent to the cluster-level test (see 

Section 2.4.2) which is the current standard in fMRI data analysis. 

Third, the multiple comparison problem is not addressed. This seems due to a confusion 
between Bayesian inference and multiple testing. Indeed, it is stated in the abstract that 
in contrast to 'conventional SPMs' (i.e., the p-value maps {pi, . . . ,pd)), the posterior 
probability maps (PPMs) (qi,. . . ^q^) 'are not confounded by the multiple comparison 
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problem'. However, both SPMs and PPMs address the same problem of selecting active 
voxels, while jointly limiting the number of false positives and false negatives. The fact 
that these are measured by frequentist rates (in the case of SPMs) or Bayesian risks 
(in the case of PPMs) does not change the fact that increasing the number of tested 
hypotheses mechanically increases the risk of detecting activations by chance. Thus, 
some form of correction for multiple comparisons must be applied in both cases. 

These issues may explain why this approach is scarcely used in practice. 

Mixture modeling An alternative proposed in [Beckmann et al., 2003b] is to fit a 
spatial mixture model to the voxel-wise test statistics, where null, activated and deac- 
tivated (inhibited) regions are modeled separately. A Gaussian distribution is used for 
null, or inactivated, voxels, a Gamma distribution for activated voxels, and a negative 
Gamma distribution for deactivated voxels. The same model has also been proposed in 
the context of individual subject fMRI data analysis, to model neural response levels 
(see Section 2.2.3). A threshold can then be derived, which minimizes a certain classi- 
fication risk, such as the binary risk, associated to the 0-1 loss function, resulting in a 
'naive Bayes' classifier. 

This approach is revisited in [Woolrich et al., 2005, Woolrich and Behrens, 2006], where 
the status of each voxel is further modeled according to a Markov random field. This 
allows to account for the spatial structure of fMRI activation patterns, where active 
voxels tend to be grouped in clusters rather than isolated. The same model has also been 
used for the modeling of fMRI time-series, as mentioned in Section 2.2.3. Active voxels 
are detected using Bayesian inference, by computing a map of posterior probabilities 
that each voxel is active, and selecting the most probable state for each voxel, as in 
[Beckmann et al., 2003b]. 

Threshold-free cluster enhancement 

In the context of cluster-level inference, dependence on the cluster-forming threshold is 
addressed in [Smith and Nichols, 2009], where an original approach is developed, termed 
threshold- free cluster enhancement (TFCE). It consists in applying a filter to the sta- 
tistical map, that has the effect of enhancing the amplitude of extended signals, while 
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reducing less extended ones. This reduces the dependence on the cluster-defining thresh- 
old, while at the same time increasing SNR. 

It must be noted though that the enhancement filter is defined by two parameters, noted 
E and H, which need to be tuned by the user, in exchange to the lessened dependence 
on the cluster-forming threshold. The choice of E and H is however extensively studied 
in [Smith and Nichols, 2009], and the robustness of the approach with respect to this 
tuning is demonstrated in practice over a wide range of datasets. 

When used in association with a multiple testing procedure, another potential shortcom- 
ing of TFCE is that the resulting statistical map does not verify subset pivotality (see 
Appendix A.l for a precise definition). In other terms, the value of the enhanced statistic 
in each voxel is not independent from the statistics in other voxels. As a consequence, 
only weak control is guaranteed for the resulting procedure, but not strong control (see 
Appendix A for formal definitions of weak and strong control, and justifications of these 
assertions). This means that the absence of bias due to TFCE, though supported by 
simulations, has yet to be demonstrated mathematically. 

Alternative choices of cluster-level summary statistic 

Finally, the choice of a decision statistic for testing cluster-level hypotheses remains 
a point of debate. The widespread use of the cluster size has often been criticized, 
as it naturally favors spatially extended signals, and may result in missing small ac- 
tivations, irrespective of their intensity. Thus, several approaches have been devel- 
oped to combine cluster statistics. In [Poline et al., 1997] the cluster size and cluster 
peak (maximum statistic value) are combined through multivariate rejection regions. 
In [Hayasaka and Nichols, 2004], a wider panel of possible cluster summary statistics 
are compared on several datasets, using combining functions. Though promising, these 
approaches have not been shown to produce significantly different results from the con- 
ventional cluster size test. The core problem seems to be that each cluster summary 
statistic defines a test that is optimally powerful for a certain type of signal (extended, 
peaked, etc.). Combining different statistics does not produce a more powerful test, but 
rather changes the type of signal that is detected. 
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To date, there is to our knowledge no cluster-level inference approach that controls false 
negative clusters, i.e., controls the risk of missing an activated cluster. 

2.6.2 ROI-Based Analysis 

ROIs in fMRI single-subject data analysis 

The use of regions of Interest (ROIs) to segment the brain volume is common in single- 
subject fMRI data analysis [Poldrack, 2007]. They are often defined as anatomical struc- 
tures, extracted from the Tl-weighted image of the subject under study or as clusters 
detected as active from a previous fMRI experiment. fMRI paradigms specifically de- 
signed to define ROIs are known as localizers. They usually consist of simple stimuli, 
meant to elicit a response from a known functional region, in order to identify its location 
for the subject under study. 

In both cases, ROIs are an attractive alternative to suprathreshold clusters for the 
definition of candidate activated regions; they avoid the troublesome choice of cluster- 
defining threshold, and provide a way to include prior information on the regions involved 
in the investigated task. When defined anatomically, they also provide a statistically 
sound way to assess the link between brain structure and function, which is one of the 
fundamental questions fMRI studies aspire to answer. 

ROIs in fMRI group data analysis 

In spite of these advantages, ROIs are scarcely used as a basis for inference in the 
context of fMRI group data analysis. Rather, clusters detected as active are empirically 
compared to anatomical ROIs, usually in the form of an atlas of brain regions, such as 
the single subject Automated Atlas Labels (AAL) [Tzourio-Mazoyer et al., 2002], in a 
post-treatment step, as illustrated in Figure 2.12. This raises several questions, because 
active voxels cannot be localized within an active cluster (see Section 2.4.2). Thus, 
when a cluster extends over several atlas regions, there is no guarantee that any of 
these regions contain any active voxels. Moreover, there is no single way of performing 
this comparison, so several heuristics are traditionally used, with potentially different 
answers. 
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Figure 2.12: Three procedures to perform the automated anatomical labehng of 
functional activations, proposed in [Tzourio-Mazoyer ct al., 2002] (the outline of the 
AAL parcellation is overlaid on the z=5.0 mm axial slice of the single-subject anatomical 
image used to define it) : (A) Local maxima labeling: the red cross indicates the location 
of the local maximum. (B) Extended local maxima labeling: percentage of overlap 
between a 10-mm sphere radius centered on the local maximum and the different parcels. 
(C) Cluster labeling: percentage of overlap between the activation cluster and the 

different parcels. 



In fact, it seems rather surprising to us that, after all the efforts devoted to develop 
statistically well-defined activation detection procedures, the crucial step wherein these 
activations are identified with anatomical structures is performed using post-hoc meth- 
ods, producing results which cannot be validated or refuted on an objective basis. 

A notable exception is found in [Bowman et al., 2008], which uses AAL to divide the 
brain volume in regions assumed to represent distinct functional units. Based on this 
parcellation, a Bayesian hierarchical approach is developed to identify regions whose 
functional responses are statistically correlated, a phenomenon known as functional con- 
nectivity. Thus, parameters at the regional level are modeled as a Gaussian vector with 
arbitrary covariance structure. By estimating the covariance matrix, functional connec- 
tivity may be assessed between any couple of ROIs, for the specific task investigated. 
Moreover, this framework could also be used to test regionwise hypotheses, as in the 
SPM-like approach, even though this is not done in [Bowman et al., 2008]. 

However, the use of ROIs for fMRI group analysis poses a series of challenges, which 
may explain why they have not yet been adopted as a standard tool. As noted in 
[Poldrack, 2007], a major issue is the important variability of anatomical structures 
accross subjects, only crudely compensated by the normalisation step (see Section 2.3.1). 
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Thus, anatomical ROIs are hard to define at the group level, and individual images are 
never aligned with any given set of ROIs. This may result in false positives, since 
activations are likely to be displaced accross regions. 

This point of view is developed in [Nieto-Castanon et al., 2003]. It is shown that the 
overlap of individual anatomical ROIs across subjects vanishes quickly as the group size 
increases (see 2.13). This observation is used to motivate an alternative analysis of 
multi-subject neuroimaging data, based on individual ROIs rather than an anatomical 
atlas. Briefly summarized, the functional data is analyzed separately for each subject- 
specific ROI. For each ROI, the corresponding summary statistics are compared across 
subjects. Note that this approach does not require inter-subject registration, since each 
subject's functional data is analyzed in reference to its own anatomy. 
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Figure 2.13: Mean overlap of individual anatomical ROIs across different subject 
group sizes in [Nieto-Castanon et al., 2003]. 



However, as discussed in [Poldrack, 2007], measuring the functional signal of a single 
subject within each given ROI is a challenging task. For each subject, there may be 
only a small proportion of voxels activated inside a given region, so that averaging the 
signal, as seems to be done in [Nieto-Castanon et al., 2003], may swamp the signal with 
the noise from the many inactive voxels. In contrast, standard mass-univariate methods 
gain statistical power by averaging the data across subjects in each voxel. Thus, they 



fMRI Group Data Analysis: the 'SPM-like' Approach 40 

may be able to recover activations which would otherwise be lost at the single-subject 
level. 

Another issue in terms of group modeling is that individual anatomical structures may 
not be present in all subjects. This must be taken into account if one wants to perform 
a RFX analysis, i.e. generalize the findings on the cohort of subjects under study to 
their parent population. 

These shortcomings may explain why this methodology has only been demonstrated 
so far in a fixed-effect (FFX) setting. Though [Nieto-Castanon et al., 2003] claim that 
random-effect (RFX) analyses would yield 'similar results', there may be both a sensi- 
tivity and a modeling issue in the generalization of this approach. 

2.6.3 Surface-based analysis 

Motivation 

The SPM-like approach searches for BOLD activations throughout the whole brain vol- 
ume. However, neurons are mainly concentrated near the cortical surface (though they 
are also present in subcortical structures). Moreover, there is evidence supporting the 
assertion that cortical landmarks, such as sulci and gyri, correspond to boundaries be- 
tween functionally distinct regions [Brett et al., 2002]. 

Consequently, volume-based detection may be sub-optimally sensitive because statistical 
power is wasted by searching for activations in the wrong places, such as the white 
matter. Also, volume-based registration may fail to match cortical structures, which 
may in part explain the bad overlap between homologous functional regions. Moreover, 
distinct functional areas which are well separated along the cortical surface may be close 
in terms of the euclidean distance, due to the highly convoluted topology of the cortical 
folds. Hence, They may be hard to separate in a fMRI contrast map, whose lower spatial 
resolution (3mm^ against up to Imm^) cannot account for such fine details. 

Based on these observations, the analysis of functional neuroimaging data based on the 
cortical surface rather than the entire brain volume has received an increasing attention 
over the last decade. The goal of such surface-based techniques is to obtain a more precise 
localization of individual functional areas, and improve their matching across subjects, 
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based on the assumption that functional regions are better identified on the cortical 
surface by the associated anatomical landmarks than in the 3D Talairach coordinate 
system. 

Surface-based registration 



A method for surface-based registration is presented in [Fischl et al., 1999]. For each 
individual, the procedure starts by reconstructing the cortical surface from a structural 
MRI image. This surface is then inflated, and transformed into a sphere using a regis- 
tration algorithm that minimizes metric distortions, as illustrated in Figure 2.14. In this 
fashion, the folding pattern of the subject's cortex is mapped using a polar coordinate 
system, so that to each point corresponds a measure C of curvature, with negative and 
positive values indicating gyral and sulcal regions respectively. This folding pattern is 
then aligned with a canonical, average folding pattern (Figure 2.14, to the right). 




Figure 2.14: Unfolding of the cortical surface onto a sphere (in [Fischl et al., 1999]). 

The advantages of this approach with respect to volume-based registration are demon- 
strated in [Fischl et al., 1999]: The overlap of cortical structures, such as the central 
sulcus, are greatly improved, and sensitivity in the detection of fMRI activation is ob- 
served. However, no clue is provided regarding how the fMRI data is projected on the 
cortical surface, though this step is bound to be a challenge in this type of analysis. 

A similar approach is developed in [Essen, 2005], involving a flattened representation 
of the cortical surface, which is mapped onto a sphere and matched to an atlas of 
the cortical folding pattern. However, a more sophisticated atlas is constructed here, 
obtained by matching anatomical landmarks extracted from each individual's cortex. 
As in [Fischl et al., 1999], this approach significantly improves the alignment of cortical 
structues (see Figure 2.15). The issue of mapping fMRI data to the cortical surface is 
addressed by simply assigning to each node of the mesh representing the cortex the voxel 
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Figure 2.15: Average cortical surfaces of 12 subjects, registered to the atlas in 
[Essen, 2005]. A - C. Lateral views of the average surface, both with and without 
inflation. D - F. Similar to A - C, but showing the medial views. The average surface 
preserves the main features of the sulcal anatomy, showing a clear improvement over 
volume-based registration using affine transformations (see Section 2.3.1). 

containing it, though [Essen, 2005] mentions 7 (at least) other algorithms that could be 
used instead. 



Projection of fMRI data onto the cortical surface 



Mapping the fMRI data onto the cortical surface thus seems to be one of the main dif- 
ficulties of surface-based analysis. Indeed, because of the low resolution of fMRI data, 
it is impossible to assign each voxel to a particular point of the cortical surface. In 
practice, no reference projection method exists, and the choice of a particular strat- 
egy may influence the results. A recent development on this subject can be found in 
[Operto et al., 2008a], wherein fMRI data is projected onto the cortical surface, using 
interpolation kernels informed by the subject's anatomy. This approach is shown to 
be more robust to registration errors than other, less sophisticated approaches. How- 
ever, the authors agree that matching of the fMRI data volume with the cortical ribbon 
remains largely an open issue, due mainly to the low resolution of current functional 
images, and the ensuing partial volume effects. 

In conclusion, surface-based analysis of fMRI data seems a promising alternative to 
volume-based approaches, by providing a better match between the cortical structures 
of different subjects, and consequently increasing the overlap of homologous functional 
regions. However, this increased anatomical precision seems to come at the cost of 
a degraded match between functional and anatomical data. Thus, in both cases the 
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Figure 2.16: Examples of anatomo-functional parcellations for a grasping task in 

[Flandin, 2004]. From left to right: mixture model with 100 Gaussian components, 500 

components, corresponding maximum a posteriori parcellation. 

uncertainty on the localization of individual activations persists, and should be taken 
into account when conducting inference at the group level. 



2.6.4 Feature-based approaches 



As mentioned in Section 2.6.3, the SPM-like approach detects activations throughout 
the whole brain volume, by comparing the individual images in a voxelwise fashion. The 
motivation behind surface-based analysis is that voxels are not relevant entities from a 
physiological point of view. 

This criticism of voxel-based comparisons leads to a broader class of methods, which 
we refer to as feature-based. The principle they rely on consists in extracting high-level 
features from the individual data, and matching them across subjects. By choosing 
relevant features that summarize the information contained in the subject's data, these 
approaches may be expected to provide results that are easier to interpret, and gain 
statistical power by reducing the number of multiple comparison. For instance, the 
ROI-based analysis of group fMRI data developed in [Nieto-Castanon et al., 2003] is 
feature-based, the features being the summary statistics measuring the BOLD response 
for each subject-specific ROI. 

The following examples have been chosen to illustrate the diversity of feature-based 
approaches. These methods are best classified according to the choice of a feature set. 
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Functionally homogeneous parcels 

The idea of parcelling the brain volume into parcels that are homogeneous both anatom- 
ically and functionally is investigated in [Flandin, 2004]. A first approach is developed, 
based on the anatomical data of a single subject, to compute a Voronoi tesselation of 
the search volume into K connected components. These tesselations are then used to 
reduce the dimensionality of the subject's fMRI data, by averaging the time courses (see 
Section 2.2.3) within each parcel. The K average time-courses are then used as input 
for the SPM-like approach, instead of the d original time-courses. 

In particular, the presence of activation is tested at the parcel rather than the voxel 
level, using the methodology presented in Section 2.4.1. As a consequence, the multiple 
comparison problem is reduced, and the greater sensitivity of this approach is demon- 
strated compared to the standard voxelwise detection procedure, even though the data 
is spatially smoothed in the last case, another way of reducing the data's resolution. 

Going one step further, [Flandin, 2004] then generalizes this approach by introducing 
a parcellation into clusters containing voxels with similar BOLD responses, based on 
both the anatomical and functional MRI data of the subject under study. Thus, in 
addition to reducing the data's dimension, the proposed procedure also defines units 
that are relevant in terms of neuroscience, since they may be interpreted as functionally 
homogeneous areas. 

Finally, this approach is extended to the context of fMRI group data analysis in the 
following way. The individual images are first normalized to a given template, to ensure 
proximity between homologous functional areas across subjects (perfect alignement of 
the images is not assumed however). Then, the fMRI time-courses of the different 
subjects are concatenated and analyzed as if they belonged to a single subject, using 
the anatomo-functional parcellation described above. This creates group-level clusters 
regrouping voxels with similar fMRI time-courses. A nice feature of this approach is 
that it implicitly defines clusters at the subject level that are automatically matched, 
clusters from two distinct subjects being homologous if they belong to the same group- 
level clusters. 

This approach provides an elegant way to deal with inter-subject anatomical variabil- 
ity, but has several limitations. Indeed, because it includes no inter-subject model, 
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this approach can only characterize the group of subjects under study, i.e. perform a 
fixed-effect analysis. Additionally, single-subject clusters are not necessarily spatially 
connected, making them more difficult to interpret as functional units. Moreover, there 
is a risk that clusters defined according to some measure of fMRI time-course similarity 
may be influenced by the main effect studied in the experiment, which in general will 
dominate the fMRI signal. Hence, such a clustering may not work well when analyzing 
more subtle effects (such as interactions between different experimental factors). 

These shortcomings are addressed in [Thirion et al., 2006c], which revisits the approach 
in [Flandin, 2004] by adding a group level to the model defining the subject parcels. 
According to this model, subject parcels are no longer forced to be subsets of group- 
level parcels, or cliques. The relaxation of this constraint allows the definition of spatially 
connected parcels, such that all subjects are represented in each clique. Furthermore, 
the data is clustered according to the vector of estimated effects rather than the raw 
fMRI time-courses, which means that it is no more dominated by the main effect. 

As in [Flandin, 2004], presence of activity is tested for each clique C, by averaging for 
each subject i the map of estimated effects over the subject-cluster homologous to C 
and performing a t-test on the resulting values. Such an approach is shown to outper- 
form conventional voxelwise tests. There is some concern however about the control 
this approach offers on false positives. Indeed, because subject clusters and cliques are 
estimated from the whole dataset, they cannot be considered spatially independent. 
Thus, there is no guarantee that the vector of clique-level t-statistics verifies the sub- 
set pivotality, hence that this procedure has strong control over the false positive risk 
considered. 

In a recent development based on similar ideas [Tucholka et al., 2008a], the same type of 
multi-subject anatomo-functional parcellations is derived, combined to a surface-based 
approach (see Section 2.6.3). More precisely, A first segmentation of the cortical surface 
into gyri defines distinct anatomical regions. Each region is then segmented into a 
certain number of cliques, represented for each subject by a parcellation into an equal 
number of clusters. The total number of cliques is optimized using a cross-validation 
criterion, providing an insight into the number of functionally distinct regions involved 
in the fMRI paradigm under study. 
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Spatial mixture modeling 

Following the same idea of grouping voxels witii similar responses that is at the core 
of clustering approaches, several mixture models have been proposed for the statistical 
analysis of fMRI data, with the specific goal of detecting activated regions. This idea 
was first pursued in [Vaever Hartvig, 2000], which considered representing the BOLD 
activation pattern as a superposition of Gaussian shaped 'bumps', in addition to a 
uniform background level. The number of bumps was determined automatically, using a 
stochastic geometry model. More recently, [Penny and Friston, 2003] defined a mixture 
model to represent the map of BOLD effects, each component corresponding to an 
activated cluster, modeled using a similar Gaussian shaped response surface, plus an 
additive noise. This idea was further extended in [Kim and Smyth, 2006], which used 
an infinite Bayesian mixture model, based on a Dirichlet process prior, to deal with the 
problem of selecting the number of components. 

The above works are all concerned with single-subject fMRI data analysis, and use a 
parametric surface response to model activation patterns. In a recent work, [Xu et al., 2009] 
uses a similar Bayesian spatial mixture modeling approach for fMRI group data analysis, 
using different levels of hierarchy to model individual activation centers and population- 
level activation centers. The spatial variability of individual centers around the popula- 
tion centers is explicitly modeled, allowing to account for mis-registrations. An original 
feature of this work is that it adopts a Bayesian model averaging point of view, integrat- 
ing out the number of mixture components, rather then estimating it and characterizing 
each cluster. Model averaging is implemented through a reversible-jump MCMC algo- 
rithm. 

Critical points of the individual activation maps 

While in the previous approaches the individual features are defined, and possibly es- 
timated, as latent variables of a certain model describing the data, it is also possible 
to directly extract them as e.g. salient features of the activation maps of the indi- 
vidual subjects, and build a group model from these. This is exactly what is done in 
[Thirion et al., 2007b], where features are defined as the critical points of the individual 
activation maps, such as local maxima and minima, and the associated level sets. These 
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Figure 2.17: Within labelled sulci framework, the activation landmark of computa- 
tion (blue ball on the picture) in the prefrontal lobe is localized near the intersection 
of the three sulci: superior precentral, marginal precentral and superior frontal (from 

[Tucholkaet al., 2008b]). 



are used to represent individual activation centers, and are modeled using a Bayesian 
spatial mixture model similar to those described above. In particular, a Dirichlet pro- 
cess prior is used to define an infinite mixture model to deal with the estimation of the 
number of classes, as in [Kim and Smyth, 2006]. Each class defines a separate group- 
level activity cluster, and the set of such clusters provides a description of the activation 
pattern at the population level. These group-level clusters are shown to be more re- 
producible across datasets than standard clusters obtained with the standard SPM-like 
analysis. 

Along similar lines, [Tucholka et al., 2008b] propose an original approach to localize 
the local maxima extracted from the individual activation maps of a group of subjects 
with known anatomical landmarks, using triangulation techniques. The local maxima 
are first matched across subjects, and viewed as functional landmarks. Then, for each 
subject, the distance of each landmark to three user-chosen neighbouring sulci (see 
Figure 2.17) is computed, then averaged over subjects. Given a new subject, the position 
of the homologous functional landmark is then predicted by triangulation, based on 
these average distances. This is compared with the position predicted by the average 
coordinates in the standard space, and shown to be much more accurate. This nice 
result shows that brain anatomy and function are linked, and suggests that the standard 
coordinate system may not be optimal to localize functional areas. 
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Figure 2.18: Reproducible activations from a right-motor contrast for 4 subjects, in 

[Operto et al., 2008b]. (left) labeled after maximization (right) thresholded individual 

statistical i-maps (p < 0.05 corrected). 



Scale-space blobs 



We conclude this review of feature-based approaches with the fMRI group data analysis 
method proposed initially in [Coulon et al., 2001] for the analysis of 3D activation maps 
and revisited in [Operto et al., 2008b] in the setting of surface-based analysis. In the 
latter, the individual fMRI data are first projected on the cortical surface, as explained 
and for the reasons exposed in Section 2.6.3. The critical points are then extracted 
from the individual t-maps as in [Thirion et al., 2007b, Tucholka et al., 2008b]. The key 
point of this approach however is the scale-space representation of these features it then 
constructs. This adds to the spatial structure linking the critical points another level of 
hierarchy, obtained by considering different levels of spatial smoothing. This description 
allows to account for patterns that appear at different levels of resolution, based on the 
idea that objects of interest happen at many different scales. 

This complex structure is coded for each subject by a graph, whose nodes are referred to 
as scale space blobs, and represent the features of interest. Group analysis is performed 
by embedding these graphs in an encompassing graph, and matching the blobs across 
subjects. The optimal matching is found by optimizing an objective function which 
accounts for the proximity of each pair of matched blobs, and the likelihood that these 
blobs correspond to an active brain area. This is performed in a Bayesian framework, 
the field of labels coding matches between blobs being modeled as hidden Markov field, 
that is estimated by a posteriori maximization. 
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The final representation of tfie functional reponse of the group of subjects consists in 
a list of labels representing activations that are reproducible across subjects, and for 
each subject a set of scale-space blobs representing instances of these activations, as 
illustrated in Figure 2.18. 

As in [Flandin, 2004] and [Thirion et al., 2006a], this creates a 'structural' description 
of the group of subjects, with informations that may be exploited both at the group and 
at the subject level. Results are however not generalizable to the general population, so 
that this approach can be categorized as performing a fixed effect analysis. 

2.7 Conclusion 

The review we have just presented, though far from exhaustive, suggests the wide variety 
of approaches that have been developed for the statistical analysis of fMRI data during 
the last decades, to address the limitations of the 'SPM-like' approach identified in 
Section 2.5: dependence on an arbitrary cluster-defining threshold, exclusive control 
of false positives, and the assumption of a perfect match between the different brains. 
However, to date these issues have been addresses separately. 

The goal of the present work is to propose a new procedure for fMRI group data analysis, 
addressing them jointly. To position ourselves with respect to the existing methods, we 
start by discussing their similarities and divergences. We have found three key points 
that summarize this comparison: 

• First, all the described approaches agree on one point, namely that group in- 
ference should be performed at a higher level than the single voxel. To quote 
[Operto et al., 2008b]: "The voxels are only the acquisition space, and have never 
had any anatomical meaning other than the simple localization provided by spatial 
normalization." 

Thus, the group activation pattern is always described in terms of higher-level 
features, such as suprathreshold clusters, ROIs, group activation centers, parcels 
or multiscale blobs. 

• Secondly, most of the methods presented here rely on preliminary spatial nor- 
malization to align subjects, using either volume or surface-based registration. 
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A notable exception is [Nieto-Castanon et al., 2003], which rehes solely on the 
anatomical structures extracted in each subject to compare them. It is generally 
accepted that perfect alignment of the anatomies is never attained in practice (see 
Section 2.3.1). In spite of this, to date feature-based approaches have been the 
only ones to account for the resulting spatial variability of the activation patterns, 
through explicit modeling. 

• Finally, apart from the ROI analysis framework presented in Section 2.6.2, none 
of the proposed approaches provide an assessment of the link between detected 
activations and anatomical brain structures, though this question is at the core 
of most fMRI studies. An exception is [Tucholka et al., 2008b], which shows that 
activation foci are better localized with respect to neighbouring sulci than than in 
the standard coordinate system (see Section 2.6.4). 

Based on our review and these final remarks, it seems to us that the ROI framework 
has a certain number of key advantages, in relation to the issues we want to address. 
It is a simple and flexible tool that allows to define regions of potential activity at the 
group level based on expert knowledge rather than an arbitrary threshold. Further- 
more, joint control on false positive and false negative risks is not only possible, but 
conceptually simple in this setting, by adopting a Bayesian framework. Moreover, as 
shown in [Bowman et al., 2008], it also provides a means of investigating the functional 
connectivity of distant brain regions. 

Feature-based approaches also constitute an appealing starting point for our work, as 
they have demonstrated their ability to account for imperfect matches between subjects. 
However, we feel that using a high-level description of the single subjects estimated 
effects maps would not necessarily help us attain our goal. The main reason for that is 
that we are not interested in describing each subject, but rather the functional network 
involved in the task at hand, in a reproducible way across subjects. Furthermore, we 
have no idea how to define these features, since there is an infinity of possible choices, 
that may all be at the expense of discarding valuable information. Thus it appears 
to us that adopting a feature-based approach in our case would only add un-necessary 
complexity. 

Thus, we propose to extend the current state-of-the-art ROI analysis methodology, by 
relaxing the assumption that the individual images are in perfect match on a voxelwise 
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basis. This can be done by explicitly modeling spatial variability, as is common in 
feature-based procedures, but this time at the voxel-level. 
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Chapter 3 

Random thresholds for linear 
model selection, revisited. 
Application to fMRI data 
analysis. 



Abstract 

In [Lavielle and Ludena, 2007] , a random thresholding method is introduced to select the 
significant, or non null, mean terms among a collection of independent random variables, 
and applied to the problem of recovering the significant coefficients in non ordered model 
selection. We introduce a simple modification which removes the dependency of the 
proposed estimator on a window parameter while maintaining its asymptotic properties. 
A simulation study suggests that the modified estimator performs better at low signal to 
noise ratios, where the original one is unstable with respect to the window parameter. An 
application of the method to the problem of activation detection on functional magnetic 
resonance imaging (fMRI) data is discussed. 
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3.1 Introduction 

A popular approach to activation detection in fMRI data analysis consists in thresholding 
a statistical map of brain activity [Friston, 1997]. 

The choice of a detection threshold is usually addressed by controlling at a user-fixed level 
a certain type I error rate, such as the family wise error rate (FWER) [Nichols and Hayasaka, 2003] 
or the false discovery rate (FDR) [Pacifico et al., 2004, Benjamini and Hochberg, 1995]. 
It can be argued however that the choice of a level, which ultimately defines the de- 
tected regions, is arbitrary, as their is no safe guideline to what an 'optimal' level of 
false detections should be. 

We consider here as an alternative the random threshold method recently developed in 
[Lavielle and Ludeha, 2007], which consists in estimating the number of non null mean 
terms among a collection of independent random variables. It is based on a random 
centering of the partial sums of the ordered observations. Consistency of the proposed 
estimator can be shown, using L-statistics techniques. 

Though requiring no prior tuning of a type I error rate, this method still depends on 
a user-chosen window width, and has so far been demonstrated on high signal to noise 
(SNR) simulated data only, whereas fMRI data is known to be very noisy. This paper 
describes a simple modification of the procedure which makes it totally unsupervised; 
the modification simply consists in replacing the fixed window width by a varying one. 

The article is organized as follows: In Section 3.2, we review the random threshold 
method in [Lavielle and Ludeiia, 2007], and introduce the varying window extension. 
In Section 3.3, a simulation study compares the random threshold procedure to the 
standard type I error rate control methods described above. Application to fMRI data 
analysis is discussed in Section 3.4. 



3.2 Method 

Assume we observe y, = fj-i + €i, where variables ej are centered, independent and 
identically distributed with common cumulative distribution function (cdf) F^. 
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3.2.1 Original random threshold procedure 

Testing the presence of significant coefficients 

We start by recalling the procedure for testing if all the /ij's are null or not, given the 
observations y^; 1 < i < n, that is, testing the null hypothesis Hq : /Xj = for i = 
1, . . . , n against the alternative Hi : 3/ C {1, . . . , n} \ \/i € I, fii > 0. The procedure is 
defined as follows : 

i) Order the observations |y(i)| > \y{2)\ ^ • • • ^ |y(n)l- 
ii) Fori = l,...,n, let X(j) = - log(l - F|,|(|y(j)|)). 
iii) Let Tj = J27=i^ii) and Qj = EHo{Tj\Tn). 



iv) Define the test statistic Dn = niaxj \Tj — Qj\/^/n. The null hypothesis is rejected 
if Dn > da, where da ensures that the level of the test is at most a. 



The conditional expectation E,ii^^{Tj\Tn) is easily computed due to the fact that under 
the null hypothesis, (-'l^(i))i<i<n is an ordered sequence of exponential random variables, 
and to the following result, whose proof is omitted here: 

Proposition 3.1. Assume ^(i), • • • ,^(ra) is an ordered sequence of Exp (1) random vari- 
ables, with X(i) > . . . > ^(n)- For any 1 < j < n, let Tj = Y17=i -^(i)- Then, for any 
I < j < K < n: 



Eho(^w) = E7 



n 



EHo(r,) = j + jY.] 









Selecting the significant coefficients 

Upon rejection of the null hypothesis, the following task consists in selecting the signif- 
icant coefficients. The procedure for doing so can be interpreted in a data-dependent 
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'multiple hypothesis testing' setting, as described hereafter. Consider the null hypothesis 
Ho as defined in Section 3.2.1, and the set of alternative hypotheses: 

Hi(k) : for any i < k, ^i^^^ > 0, and fi(^k+i) = ■■■ = /^(n) = 0. 

Denote E^ the expectation under Hi(k) (instead of lEHi(k))- The procedure first com- 
putes the X(j)'s using the same steps i) and ii) as in Section 3.2.1, then adds the following 
steps: 

iii) Let Kn be some positive integer. For 1 < k < n — Kn and 1 < j < Kn^, compute: 

k+i 

Tk,j = Y^ Xf^i) 

i=k+l 
Qk,j = ^k{Tk,j\Tk,Kn) 

rjk = max \Tkj - Qkj\/Vn. 

l<j<K„ 
iv) Let kn = argmini<fc<^^ryfc- 

As in the preceding section, Qkj can easily be computed using Proposition 3.1. rjk 
can also be defined from the centered partial sums (Tfcj- — Qkj) using the ip norm for 
1 < p < oo rather than the £oo norm, by setting rjk = n^P'"^^^ Ylj=i \'^k,j — Qk,j\^- 

Unknown distribution extension 

The above procedure can be extended to the case where the distribution F^ of the e^'s 
is a parametric distribution F^(- ; 0*), but where 9* is unknown. 

For < k < n — 1, let 6k = 9{yk+i, ■ ■ ■ ,yn) be an estimator of 9. Let i^|e|(- ; 9*) be 
the distribution of the |ej|'s. For any G 0, let Xi{9) = —log (l — -F|e|(|y(j) ; ^|)) and 
Tk,j{9) = X]j=fc+i Xu\{9). Then the following procedure is defined : 



i) Let Kn < [(1 — b)n] be some positive integer. For [an] < k < n — Kn 

1. let 9k = 9{yk+i,...,yn), 

2. for i = l,...,n, let X(j)(4) = -log(l - i^|e|(|y(i)| ; Ok 
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3. for 1 < j < Kn, compute 

k+j 

i=k+l 
Qkj{&k) = '^k{Tk,j{dk)\Tk,Kn{^k)) 

Vk{Ok) = max \Tkj{Ok) -Qk,j{Ok)\/Vn. 

ii) Let kn = aTgmm^n<k<bnVk{Ok)- 

In the applications presented in Section 3.3 and Section 3.4, we consider Gaussian noise: 
F^{- ] cj^) = A/'(- ; 0, (T^) and estimate 6* = o"^ by the usual mean squares estimator: 

n 1 v-^n 2 

^f^ ~ v::^ 2^i=k+iy{i)- 

3.2.2 Varying window extension 

The procedures defined in the preceding sections depend on a parameter Kn which can 
be interpreted as a window width, since % is a function of X(^_,_;^), . . . , X^^j^j^p^^y Tuning 
this parameter may be difficult, as seen in Section 3.3. This issue can be avoided by 
re-defining r]k as a function of X(^k+i)j • • • > ^{n)^ thus replacing the fixed width Kn by a 
varying width n — k, which requires no prior tuning. We define the following procedure: 

iii) Let k„ be a lower bound on the number of null coefficients. For 1 < k < n — Kn 
and 1 < j < n — k, compute: 

Qk,j = ^k{Tkj\Tk^n-k) 

rjk = maxi<j<„_fc \Tkj - Qk,j\/Vn - k. 

iv) Let kn = argmini<fc<„_^^r/fc. 

In other terms, r]k would be strictly equal to the test statistic Dn defined in Section 3.2.1, 
if the sequence (X(j))i<j<„ where replaced by the subsequence (X(j))fc+i<j<„, i.e., the 
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null set of variables under Hi(k). As in the previous section, ip norms can be used 
instead of the ^oo norm by setting rjk = {n — k)^^'"^^^ Z]?=i \^k,j — Qkj\^- 

Notice that A;„ is independent of k„, as long as r/ reaches its global minimum on 
{1, . . . , n— K„}. It is also immediate that the varying window extension, which we present 
here for the procedure in Section 3.2.1, can be applied to the unknown distribution ex- 
tension in Section 3.2.1. 

Asymptotic properties 

The estimator of the number of significant coefficients presented in Section 3.2.1 is 
consistent. This is the main result in [Lavielle and Ludeha, 2007], and it can be extended 
to the varying window setting. We start by recalling the following asymptotic framework: 

AFl There exists t* G (0, 1) and a subset I^* of {1, . . . , n}, with A;* = [t*ri] and jl^* | = 
fc*, such that /Uj / if i G 1^* . For all other index, /ij = 0. 

AF2 For any i G J^*, \^i\ > On, for a certain sequence (a„) with a„ — ;■ oo (see 
[Lavielle and Ludeha, 2007] for further details). 

AF3 Kn/n -^ c such that < c < 1 -t*. 

We then have the following result: 

Theorem 3.2. Let kn stand for the estimator defined in Section 3.2.2. Under assump- 
tions AFl, AF2, AF3, and appropriate Von-Mises type conditions on the cdf F^ of the 

errors (see [Lavielle and Ludeiia, 2007] for further details), kn is consistent in the sense 
that: 



> n„ ^ 0, (3.2) 



kn 

n 
for any positive decreasing sequence (un) such that y/nun — )■ oo. 

This result can be refined by deriving an upper bound, which we do not detail here, 
on the convergence rate of the probability in Equation (3.2), for a particular choice of 
sequence (un). The proof of this Theorem is given in Appendix B. 
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Consistency also holds in the unknown distribution case (see Section 3.2.1). Consider 
the following assumptions on the cdf Fi^i of the |ej|'s: 

Fl. Fi^i is two times differentiable as a function of 6 with a.e. strictly positive derivative 
at 9 = e\ 

F2. 0* belongs to some compact set and there exists under H^* a consistent estimator 
^fe* =0{yk^+i,...,yn) of 61*. 

F3. There exists (a, b) such that 0<a<t*<6<l and a Lipschitz continuous 
function 9 defined on [a,b] such that, under H^*, (^[tn]) converges in probability 
uniformly on [a,b] to 9{t). 

Then we have the following result: 

Theorem 3.3. Let kn stand for the estimator defined in Section 3.2.2. Assume Fl, F2 
and F3. Let {un) he any positive decreasing sequence such that \/nUn -^ oo. Under the 
asymptotic framework defined by AFl, AF2, AF3, 



Hi(fc*) 



tCr, 



n 



> u, 



0. 



The proof of this result in the varying window setting follows the same lines as that of 
Theorem 3.2, and is not detailed here. 
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Figure 3.1: The sequence % for < /u^ < 4 

Dashed line: fixed window width Kn = 5 000; Solid line: varying window width. 



Random thresholds for linear model selection, revisited. 60 

Table 3.1: Number of misclassified voxels (false positives and false nega- 
tives) for different thresholding methods. 





Kn = 15 000 


Kn = 35 000 


var. window 


GGM 


< /ii < 4 


7477 ± 193 


7148 ± 228 


7120 ± 237 


11016 ± 666 


1 < /li < 5 


5079 ± 243 


4736 ± 250 


4706 ± 255 


8306 ± 1862 


2<fii<6 


2561 ± 185 


2385 ± 182 


2381 ± 184 


1901 ± 125 


3 < /ii < 7 


930 ± 118 


892 ± 120 


891 ± 120 


766 ± 102 


4 < /^i < 8 


297 ± 55 


298 ± 68 


297 ± 67 


262 ± 52 



Results are obtained over 100 replications, and given in the form: mean it std. deviate. 

3.3 Simulations 

3.3.1 Experiment summary 

In this experiment, we have repeatedly simulated n = 50 000 Gaussian random variables 
with /Xj distributed uniformly on [a,b] for 1 < i < 10 000, and /U^ = for 10 001 < i < 
50 000. (ej) is a sample from the Af{0, 1) distribution. 

(a, 6) was chosen successively equal to (0,4), (1,5), (2,6), (3,7), (4,8). 100 datasets 
where simulated for each of these values. On each simulated dataset, we then used the 
procedure described in Section 3.2.1 (the unknown distribution extension), assuming 
a Gaussian cdf for Fi^i with unknown variance, and compared the results for different 
window widths Kn = 15 000, 35 000. We also used the varying window extension, as 
described in Section 3.2.2, applied to the unknown variance setting, and the Gamma- 
Gaussian mixture (GGM) modeling procedure in [Beckmann et al., 2003b]. 

Our choice of simulation parameters was guided by the fact that 50 000 is roughly the 
number of voxels in a whole-brain fMRI activation map. 8 can be seen as an upper bound 
on the maximal signal intensity, since in the real data we analyzed (see Section 3.4), the 
maximum z-score was equal to 6.82 for the individual subject activation map, and 5.08 
for the between-subject activation map. 

3.3.2 Results and discussion 

We compared the different approaches through the error rates they achieved on the 
different datasets. We considered the binary risk, i.e., the number of misclassified voxels. 
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Figure 3.2: Different thresholding strategies for < ^^ < 4. 

Top: Gamma-Gaussian mixture fit (solid curve: data, dashes: Gaussian, dash-dots: 
Gamma). Bottom: Random thresholds with Kn = 5 000 (dashed line) and varying Kn 
(solid line). Data corresponds to the solid curve, null data to the dashes and non null 
data to dash-dots. 

Table 3.2: False positive rates for different thresholding methods. 





Kn = 15 000 


K„ = 35 000 


var. window 


GGM 


< /ii < 4 


0.001 ± 0.001 


0.002 ± 0.001 


0.002 ± 0.001 


0.219 ± 0.019 


1 < /li < 5 


0.001 ± 0.001 


0.002 ± 0.001 


0.002 ± 0.001 


0.182 ± 0.062 


2<fii<6 


0.002 ± 0.001 


0.003 ± 0.001 


0.003 ± 0.001 


0.009 ± 0.003 


3 < /ii < 7 


0.002 ± 0.001 


0.002 ± 0.001 


0.002 ± 0.001 


0.004 ± 0.001 


4 < /ii < 8 


0.001 ± 0.001 


0.001 ± 0.001 


0.001 ± 0.001 


0.002 ± 0.001 



Results are obtained over 100 replications, and given in the form: mean it std. deviate. 

as a measure of how well the null and non null sets where separated (see Table 3.1). We 
also computed the false positive rate (FPR) (see Table 3.2). 



As could be expected, differences between methods were most observed when the data 
was simulated with a low signal-to- noise ratio (SNR), (first and second line of each 
table). In this case, the Gamma-Gaussian mixture (GGM) model is clearly misspecified 
since it cannot account for negative observations in the non- null set, as illustrated in 
Figure 3.2. This leads to risk and false positive rate (FPR) values much higher than 
those obtained by the other thresholding methods. 
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For higher SNRs, the different approaches gave similar results, even though the GGM 
model yielded slightly lower risks than the different random thresholds, at the cost of 
higher FPR values. 

On the whole, the different random thresholding procedures gave very similar results in 
all cases, and yielded mean FPR values that were systematically below 0.003, and also 
much lower than those found by GGM fit. 

In conclusion, this experiment suggests that the random thresholding approach provides 
a very good control on false positives, and yields reasonable risk values at all SNRs, with 
a slight advantage for the varying window extension. In contrast, the GGM fit appears 
to yield more false positives, and performs poorly at low SNRs. 

3.4 fMRI data 

We used an event-related fMRI protocol involving a cohort of 37 right-handed subjects. 
The participants were presented with a series of stimuli or were engaged in tasks such as 
passive viewing of horizontal or vertical checkerboards, left or right click after audio or 
video instruction, computation (subtraction) after video or audio instruction, sentence 
listening and reading. Events occurred randomly in time (mean inter stimulus interval: 
3s), with ten occurrences per event type, and ten event types in total. 

The subjects gave informed consent and the protocol was approved by the local ethics 
committee. Functional images were acquired on a General Electric Signa 1.5T scanner 
using an Echo Planar Imaging sequence (time of repetition = 2400 ms, time to echo = 
60 ms, matrix size = 64 x 64, field of view = 24 cm^). Each volume consisted of 34 
64 X 64 3 mm-thick axial contiguous slices. A session comprised 130 scans. Anatomical 
Tl weighted images were acquired on the same scanner, with a spatial resolution of 
1 X 1 X 1.2 mm^. Finally, the cognitive performance of the subjects was checked using 
a battery of syntactic and computation tasks. 

First-level analyses were conducted using SPM5 http://www.fil.ion.ucl.ac.uk. Data were 
submitted successively to motion correction, slice timing and normalization to the MNI 
template. For each subject, BOLD contrast images were obtained from a fixed-effect 
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Figure 3.3: Axial slice from a z-score map for the " sentence checkerboard" 

contrast. 

From left to right: Unthresholded, thresholded by GGM model fit, varying-window and 

fixed-window (Kn = 15 000) random thresholding. Detected activations are 

superimposed on the subject's anatomical image. 

analysis on all sessions. Group analyses were restricted to the intersection of all subjects' 
whole-brain masks, comprising 43 367 voxels. 

We considered the t-score maps computed for different contrasts of experimental condi- 
tions. These were first converted to z-score maps, to obtain approximatively Gaussian 
statistics in inactivated voxels. Using these maps as input data, we then compared, as 
in the simulation study, the detection thresholds obtained by Gamma-Gaussian mixture 
modeling (GGM), fixed- window random thresholding and the varying-window exten- 
sion using in the last two cases the unknown variance extension of the method (see 
Section 3.2.1). For simplicity, we only present here the results obtained for a fixed 
window equal to Kn = 15 000. 

3.4.1 Individual subject activation map 



Our first illustration concerns the activation map of a single subject, for the "sentence- 
checkerboard" contrast. This contrast subtracts the effect of viewing horizontal and 
vertical checkerboards from that of reading video instructions, thus allowing to detect 
brain regions specifically implicated in the reading task. 

Figure 3.3, left, shows an axial slice from the z-score map before thresholding. Acti- 
vations are clearly seen in Wernicke's and Broca's areas (right and upper right), which 
are known to be involved in language processing (see [Price, 2000], for instance). The 
detection threshold found by GGM fit for the z-score map (2.03) is much lower than 
those found by the random threshold procedure, both with a varying window (3.19) and 
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a fixed window (3.33). We note tliat these thresholds follow the same order found in the 
simulation study. 

The random thresholds with fixed and variable windows yield very similar activation 
maps in this case, which seem to capture the activated regions seen in the raw map. In 
contrast, the much lower threshold found by mixture modeling detects several smaller 
clusters, some of which may be false positives. 

3.4.2 Group activation map 

In this second example, we consider a group activation map, specifically a map of t- 
statistics computed from the individual contrast maps of 15 subjects, thus enabling to 
infer regions of positive mean effects in the parent population. Our choice of limiting the 
number of subjects, rather than using the whole cohort, was driven by the fact that many 
fMRI studies are conducted on groups of less then 20 subjects. The remaining subjects 
were used to assess the variability of the thresholds found by the different approaches 
with respect to the choice of the subgroup, as described in Section 3.4.3. 

We report results from the "calculation-sentences" contrast, which subtracts activations 
due to reading or hearing instructions from the overall activations detected during the 
mental calculation tasks. This contrast may thus reveal regions that are specifically 
involved in the processing of numbers. 

Figure 3.3, left, shows an axial slice from the activation map before thresholding, with 
clear activations in the bilateral anterior cingulate (upper middle), bilateral parietal 
(lower left and right) and right precentral (upper right) regions, all known to be involved 
in number processing, as explained in [Pinel et al., 2007]. 

Though sorted in the same order as previously, the varying window random threshold 
(2.49) is now roughly at equal distances from the threshold found by GGM modeling 
(1.79) and the fixed window random threshold (3.06). 

The three methods detected activations in the regions described above, though the fixed 
window random threshold seemed to miss some activations, and the GGM approach 
further detected smaller clusters, some of which may be false positives. 
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Figure 3.4: Axial slice from the group activation z-score map for the 
"calculation sentence" contrast. 

From left to right: Unthresholded, thresholded by GGM model fit, varying-window and 

fixed-window (Kn = 15 000) random tliresholding. Detected activations are 

superimposed on tlie mean anatomical image of all subjects. 

3.4.3 Reproducibility study 



In both experiments described above, it seemed that the varying window threshold found 
a good compromise between the GGM fit, which selected possibly inactive voxels, and 
the fixed window random threshold, which seemed to be overly conservative. However, 
we cannot conclude from these observations alone that the varying window threshold is 
'better' than the other thresholds. 

To compare them on a more objective basis, we studied the variability of the different 
thresholds with respect to the choice of a subgroup. More precisely, we repeatedly and 
randomly selected a group of 15 distinct subjects from the cohort of 38, computed the 
corresponding group activation map for the computation task, and thresholded it by 
all three methods. We then computed the empirical mean and variance of each sample 
of threshold values. These provided unbiased estimates for each threshold's mean and 
variance, with respect to the data's sampling distribution: 

Proposition 3.4. Let Yi = {ya, . . . , ym) be the activation map for subject i G {1, . . . , N}, 
let J C {1,...,A^} be any subgroup of subjects, and T = T{Y,J) = T({li}jgj) any 
statistic computed from the activation maps indexed by J. Define: 



7ren„ 



7ren„ 
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Table 3.3: Mean and variance estimates of the thresholds found by different 
methods for the computation task. 



Method GGM fix. window var. window 



Mean 1.92 2.96 2.54 

Variance 0.76 0.12 0.08 



where Hn is the set of all permutations on {1, . . . , N}. If the Yi 's, for i G {1, . . . , N}, 
are independent random variables following a common distribution V on M", then E(T) 
and V(T) are unbiased estimates of the expectation and variance of T, respectively. 

Proof. This is an immediate consequence of the first two lemmas from Section 6 in 
[Strasser and Weber, 1999]. 

Note that we used Monte-Carlo estimates of the quantities defined in Proposition 3.4 
by using 100 random permutations, since performing exhaustive permutations was in- 
tractable. 

As can be seen in Table 3.3, The GGM yields the most variable threshold, and the 
varying window is slightly less variable than the fixed window random threshold. Thus 
the varying window variant yields the most stable threshold among those tested here. 

3.5 Discussion 

By introducing a simple modification to the random procedure proposed in [Lavielle and Ludeiia, 2007], 
we have obtained an entirely unsupervised procedure for recovering non null mean terms 
from a collection of independent random observations, based solely on a parametric 
model of the null terms. Importantly, our modification, which requires no prior tuning, 
conserves the consistency properties of the original procedure. 

Simulation results suggest that the random threshold approach has very good properties 
in terms of type I error rate control. While the fixed window seems overly conservative, 
and unstable with respect to the window parameter at low signal to noise ratios, the 
varying window extension, which avoids instability, also seems to find a better compro- 
mise between specificity and sensitivity. 
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The first results on real fMRI data are encouraging and seem to confirm the good 
stability achieved by the varying window random threshold, with respect to the other 
tested strategies. 

A more thorough investigation is needed to confirm these preliminary results. Repro- 
ducibility of the regions found by random thresholding, rather than just the threshold 
values, can be assessed by resampling techniques similar to those used here, as described 
in [Thirion et al., 2007a], and compared to mixture modeling, as well as to false positive 
control strategies. Finally, studying the asymptotic properties of the random threshold 
estimator when null mean and non-null mean terms are not well separated is also very 
important in view of the applications. 



Chapter 4 



Modeling spatial uncertainty 



Abstract 

This chapter presents an extension of the mass univariate detection approach for group 
fMRI data, which relaxes the assumption of perfect match between the effect maps of the 
different subjects. A set of hidden variables is introduced in the classical mass univariate 
model, which represent registration errors, and are modeled as random deformation 
fields. The group mean effect map is estimated in a Bayesian setting by its posterior 
expectation. This is evaluated numerically, using a Metropolis- within Gibbs algorithm to 
sample from the posterior density of all hidden variables. We also show the consistency 
of the posterior density of the model parameters. 

Using simulations, we evidence a stretching effect of the estimated activation pattern 
when the registration errors are unaccounted for, causing neighboring activations to 
be merged. This stretching effect is substantially reduced when registration errors are 
modeled. When applied to real fMRI data, our method yields group effect maps under 
spatial uncertainty that are both smoother and more contrasted than under no spatial 
uncertainty, an effect that cannot be reproduced by linear isotropic smoothing. These 
results are obtained in spite of the slow mixing of our posterior sampling algorithm, 
suggesting some space for improvement. 
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4.1 Introduction 

In this chapter, we relax the assumption of perfect match between individual brains, 
which is one of the limitations of the SPM-like approach identified in Section 2.6. To 
this end, we incorporate in the mass univariate model specified by (2.5) and (2.6) a 
set of hidden variables, representing spatial normalisation errors, which are modeled 
as multivariate random fields. A first implementation of this idea has been published 
in [Keller et al., 2008]. 

In the following, the individual effect maps are seen as warped and noisy versions of the 
group activation pattern to be estimated. Our goal here is not to solve the registration 
problem but rather to develop an inference strategy that accounts for registration errors. 
When performing registration, the spatial transformations are the parameters of interest. 
In our case, these transformations are viewed as nuisance parameters to be integrated 
out during the analysis. Also, they are modeled as residual deformation fields, which 
persist after the actual registration has taken place. This precludes the use of global 
linear transformations, such as translations or rotations, to model the deformations. 
Rather, these are seen as local, zero-mean displacements, with no pre-determined form. 

There exists an extensive literature on image registration. General reviews on this 
subject can be found in [Brown, 1992, Zitova and Flusser, 2003]. More specific reviews 
include [Van den Elsen et al., 1993, Maintz and Viergever, 1998], on the registration of 
medical images, and [Toga, 1999], on registration in brain imaging. Because we wish to 
model registration errors, we are interested in a statistical formulation of the registration 
problem. In [McGillem and Svedlow, 1976], a first step in this direction is taken, where 
the two images to be aligned are considered as identical up to a deformation and an 
additive Gaussian noise, justifying the use of the variance as a measure of overlay quality. 
More generally, in addition to such a dissimilarity measure, the criterion to be optimized 
by the registration procedure may contain a so-called regularization or penalty term, 
which imposes constraints on the transformation [Hajnal, 2001]. This penalty term 
may take many different forms. In the framework of pattern theory [Grenander, 1993], 
deformations are modeled in a Bayesian setting, in which case the penalty is interpreted 
as a prior density. 
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Furthermore, the problem we address here is closely related to template estimation, also 
referred to as atlas construction in the medical image analysis literature [Subsol, 1995, 
Joshi et al., 2004]. A simple atlas construction method consists in iterating a 'registra- 
tion step' and an 'averaging step': on each iteration, individual images are registered to 
a current version of the template, and then averaged in order to update the template. 
For instance, the widely used MNI anatomical template was constructed in this fashion 
(see Section 2.3.1). However, there is no guaranty that this scheme yields a statisti- 
cally consistent estimate of the mean image. More recently, [AUassonniere et al., 2007, 
Allassonniere et al., 2008, Lepore et al., 2008, Sabuncu et al., 2008] have proposed to 
construct a Bayesian estimate of the template, given prior distributions on both the 
deformation fields and the template parameters. Consistency of the template's MAP 
estimator is shown in [Allassonniere et al., 2007]. 

This chapter is organized as follows: in Section 4.2, we introduce spatial uncertainty in 
the mass univariate model specified by (2.5) and (2.6), under the form of unknown spatial 
deformation fields. The deformation fields are modeled in Section 4.3. From this model, 
we derive in Section 4.4 a Bayesian estimate of the template //, given a certain prior dis- 
tribution on the model parameters, in an approach similar to [Allassonniere et al., 2007]. 
Section 4.5 illustrates on several simulation studies how techniques that ignore spatial 
variability may result in blurring and stretching activation patterns, and how this effect 
can be reduced by our approach. Finally, results on real fMRI data are presented in 
Section 4.6. 



4.2 Observation model 

Following the notations introduced in Section 2.3.2, Xj = (xi^i, . . . ,Xi^d) is the map of 
BOLD effects of subject i = 1, . . . , n, in response to a certain contrast of experimental 
conditions; y, = (yi,i, . . . ,2/j,d) is the noisy estimate of Xj, available from the analysis 
of the subject's scans (see Section 2.2), and s? = {sfi,..., sjj) an image of estimation 
variances. Recall that, under a sufficient number of acquired scans, the estimation error 
Ei can be assumed Gaussian, so that: 
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Yi = Xi + Bf, ei~AA(0,diag(sf)), (4.1) 

where diag(s?) denotes the diagonal matrix whose diagonal is given hj {sf^, . . . , sj^). As 
in [Keller et al., 2008], we extend the mass- univariate Gaussian between-subject model 
described in Section 2.3.2 by incorporating spatial uncertainty so that at voxel k : 



Xi(vfc) = /x(vfc + Ui,fc) + ei,fc; ^,^Af{0,a^Id). (4.2) 

Here, we note Xj(vfc) = Xi^k to emphasize that it is a spatial map; /i G M'^ is the map 
of mean population effects; the vector Uj^^ G M'^ is a hidden variable that models the 
subject-to-atlas registration error for subject i at voxel k; and the (,i^k, 1 < ^ < ^, model 
the between-subject variability of the effect at voxel k. Finally, we define /^(v^-l-Uj^) by 
discrete interpolation, as being equal to the mean population effect in the nearest voxel 
Vfc' = argmiuv; ||v; — (v^ -|- Uj^fc)||. In practice, this is done by rounding each coordinate 
of Uj fc toward the nearest integer. We will also note in the following c^j the function that 
maps each voxel index k to the corresponding displaced voxel index k' for subject i, so 
that /X(VA: + Ui^k) = IJ-v.ik)- 

(4.2) generalizes the classical mass univariate model (2.6), which corresponds to the spe- 
cial case where the registration errors Uj jj are neglected. Note however that the between- 
subject variance o"^ is uniform over the search volume in our formulation, whereas it is 
voxel dependent in the classical setting. Indeed, we found from practical experience that 
a voxel dependent variance raised overfitting issues when modeling registration errors, 
and resulted in degenerate estimates in certain voxels. We used a uniform variance 
as the simplest form of constraint to avoid this overfitting, at the price of a possible 
over-simplification. In Section 5.2, we introduce a regionalized version of Equation (2.6) 
where a^ is allowed to vary across functionally distinct regions. More sophisticated 
methods could be considered, for instance by modeling the variance map o"^ as a smooth 
random field, and constitute a possible direction for future work. 
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4.3 Deformation field model 

As a standard way of representing nonlinear local deformations [Zitova and Flusser, 2003], 
we use Gaussian splines to model the displacements Uj^^. Specifically, elementary dis- 
placements Wj {, are defined in a limited number of fixed control points {v^^^, b = 1, . . . , B}. 
These are then interpolated to the whole brain image using a radial basis function /C, 



B 



Uj,fe = ^/C(vfc,VfeJwj_b, (4.3) 



b=l 



where /C(vfc, Vk^) = exp — {||vfc — Vfc^|| /2a; }, and a; is a user-chosen parameter control- 
ling the displacement field's smoothness. 

The Wjb's are modeled as a i.i.d. Gaussian variables, with spherical covariance matrix 
cr|l3, where as models the standard registration error: 



7r(w,|a|) oc J]AA(w,,fe;0,cT|l3). (4.4) 

b=l 

4.4 Posterior mean estimate 

We estimate the population effect map /x by its posterior mean: 

HfJ-\y] = / M7r(/x|y)(i/2, 

where 7r(/i|y) oc 7r(/x)/(y|/^) is the posterior density of// relative to a given prior density 
Tr{fi). To define this prior, fj, is modeled according to: 

lJ'k = V + Xk; Xfc ~ A/'(0,i^^), 

where rj represents the mean activation across voxels, and u"^ the variance of the activa- 
tion pattern. This hierarchical prior can be seen as an instance of the regional model 
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developed in Chapter 5, in the special case of a single functionally homogeneous re- 
gion. We then define a prior density on these hyperparameters, as well as on the spatial 
uncertainty parameter o"|, as explained in Section 5.4. 

The integral in the above display cannot be computed analytically, so we resort to 
MCMC strategies instead to produce a sample fJ-i, ■ ■ ■ , Hq from the posterior density 
7r(/Lt|y), yielding the following Monte-Carlo estimate: 



9=1 



G 
It 



Sampling the posterior density in the model with spatial uncertainty raises some techni- 
cal issues. The simplest strategy would be to use a Gibbs sampler [Geman and Geman, 1984] , 
to generate a sequence of samples from the joint posterior density of all hidden variables 
X, w, /x, {riju), cr^, and (t|. This is done by partitioning the variables into blocks, then 
sampling successively each block conditionally on all others. However, it turns out that 
the conditional distribution of each elementary displacement Wj^ has no closed-form, and 
cannot be directly sampled. Therefore, we use the more general Metropolis-Hastings 
(MH) algorithm [Hastings, 1970]. This involves the choice of a proposal density to gen- 
erate candidate values Wj^, which are then accepted with a certain rate. Thus, we use 
a Metropolis-within- Gibbs algorithm [Tierney, 1994] that generalizes the standard Gibbs 
sampler by the incorporation of MH iterations. Technical details on this algorithm can 
be found in appendix D, for the more general model introduced in Chapter 5. 

As is often the case with Bayesian estimation, the posterior distribution 7r(/x|y) has very 
good frequentist properties. In particular, it is consistent, i.e., it concentrates on the 
true value //g of the mean population effect map, assuming the data is indeed distributed 
under our generative model /(y|/x). Thus /i, as well as any other Bayes estimator based 
on this posterior distribution, are reasonable choices to estimate /Xq. 

Theorem 4.1 (Consistency of the Posterior Distribution). Let 6 = {fj,,a'^,a'^) denote 
the parameter vector of the model defined by (4-1), (4-^)) (4-3), and (4-4)! o.f^d y'"^ = 
y = (yj)i<j<n the data vector, where n is the number of observed activation maps. Then 
for all e > O.■ 
P[||0-6>o||>e|y(")]''-^O, 

for any value of the true parameter 6q, except possibly on a set of tt -measure 0. 
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Proof. This derives from Doob's theorem (see [van der Vaart, 2000], Chapter 10, for 
instance). Doob's theorem states that in an identifiable parametric model and for any 
prior distribution vr, the sequence of posterior densities is consistent for vr-almost every 
parameter value. The identifiability of our model is demonstrated in Appendix C. 

However, Doob's theorem applies only to independent, identically distributed (iid) vari- 
ables. In contrast, the observations y^ in (4.1) are defined conditional on the first-level 
variances s^, which are different for each subject, and so are not identically distributed. 
This can be arranged through the device introduced in [Meriaux et al., 2006], where the 
s? are considered as part of the data vector, and modeled as a sample from an unspecified 
density /(s?), independent from all other variables. This has no effect on the posterior 
density, and under this assumption the observations (yj,s^)i<j<„ are iid, so that the 
result holds D 

4.5 Simulations 

4.5.1 ID simulations 

We now illustrate our model of spatial uncertainty on a simplistic one-dimensional (ID) 
example. Our purpose here is to give some insight on how deformations are modeled, 
and how they can impact the estimation of the activation pattern fi, which in this case 
is simply a real-valued function. 

Description of the dataset 

The voxels, aligned and equally spaced in this example, are represented by the integers 
k = 1, . . . , 50. We simulated a deformation field according to model (4.3), with a standard 
displacement of as = 3 voxels, and a kernel width of oj = 6.5 voxels. We also empirically 
set the distance between adjacent control points to 2 x lj = 13.0 voxels, with no control 
points at at less then 2.5 x uj = 16.25 voxels from the voxel set boundaries. In the 
present case, this resulted in two control points only, as illustrated in Figure 4.1. 

This displacement field was used to warp a signal defined as the sum of three Gaussian 
'bumps', representing the mean effect map /x, following (2.6), (see Figure 4.2, middle). 
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Figure 4.1: Illustration of the deformation field model. Elementary displacements 
(arrows) in pre-specified control points are interpolated to all other points using a radial 
kernel (dashed lines). The displacement field (solid line) is the sum of the resulting 

weighted kernels. 



Heteroscedastic noise with variance cr^ + s^^ was then added to this warped signal to 
produce a synthetic observation y^ (see Figure 4.2), according to (4.1), (4.2). We chose 
a = 1.0, and the individual standard deviates Si^k where generated as independent 
standard normal variables, multiplied by a noise level of e = 4.0. 

Methods compared 

We generated n = 40 observations as described above, and tried to recover the original 
signal n from these warped, noisy versions. To do this, we computed the posterior 
mean E[/^|y], as explained in Section 4.4. We compared the estimate obtained in the 
full model with that obtained in the model without spatial uncertainty, setting (t| = 0. 
In both cases, the posterior mean was averaged over 100 Gibbs iterations, following 100 
'burn-in' iterations which were discarded. 



It can be seen in Figure 4.3 that, when sampling the posterior distribution of the full 
model, the Markov chain quickly reached a stable state, after approximately 100 iter- 
ations. However, the resulting posterior mean estimate of the spatial incertitude pa- 
rameter, as = 1.6, was lower than the true value, as = 3.0. Conversely, the posterior 
mean of the between-subject standard deviate, a = 2.0 was higher than the actual value, 
a = 1.0. 
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Figure 4.2: Illustration of the model with spatial uncertainty on ID data. 




Deformation field (corresponding to Ui^k in (2.6)). For instance, [ui^is] = —1, [11^,24] 
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Effect of the deformation field on the signal fi (solid line). For instance, the warped signal at 
point k = 24 is equal to the true signal at point 24 + [1*2,24] = 24 + 3 = 27. 
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Data Yj (solid line) , obtained by adding Gaussian noise to the warped signal. 



A possible explanation is that the chain was trapped next to a mode of the posterior 
distribution, away from the global maximum. Escaping from this mode would theoret- 
ically happen after a sufficient number of iterations, but there is no upper bound on 
the CPU time this would require. This illustrates the difficulty of sampling from the 
joint distribution of the deformation fields, even in this simplistic ID example with two 
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Figure 4.3: Posterior sampling of different parameters in the ID model with and 

without spatial uncertainty 
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control points per field. 

Alternatively, this bias in the variance estimates may also be a consequence of the so- 
called shrinkage (or regularizing) effect, i.e. a compromise between estimator bias and 
variance induced by the prior distribution. Specifically, the model on spatial displace- 
ments w (4.4), together with the prior on the spatial standard error as (5.12), favor 
small displacements, thus preventing unreasonable distortions of the individual images 
which would artificially increase the likelihood by overfitting the data. 

Supporting this explanation, the convergence plot a (Figure 4.3, upper right, solid line), 
shows that its sampled values rapidly decrease at first, while the spatial incertitude 
parameters augments (Figure 4.3, upper left). This suggests that the Markov Chain 
is able to find a significantly better match between the individual images by spatially 
deforming them, a being a measure of their global dissimilarity (in contrast, a values 
sampled in the model without spatial uncertainty (Figure 4.3, upper right, dashed line) 
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Estimation of a spatially deformed signal 
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Figure 4.4: Posterior estimates of /x in the ID-model, with and without modeUng 

spatial uncertainty. 



do not vary significantly from their initial value). Then, both a and its' settle in a stable 
state, presumably because an equilibrium has been met between a good match across 
the images (low a) and small, regular deformations (low as)- 

Additional indicators of the enhanced match between individual images, achieved when 
modeling spatial uncertainty, is provided by the convergence plot of the group effect 
map's mean r/ (Figure 4.3, bottom left, solid line) which is less variable than in the 
model without spatial uncertainty (dashed line). Also, the resulting map /i is more 
contrasted, as can be seen from the higher sampled values of its variance v (Figure 4.3, 
bottom right). 

Finally, we can see from Figure 4.4 that the posterior mean estimate \x under the full 
model successfully recovered the three different modes present in the original signal. In 
contrast, as could be expected beforehand, the estimate which neglected the signal's 
spatial variability behaved poorly. As a result, regions of positive signal were swelled, 
and the two nearby peaks merged in a single mass. 

Sensitivity study 



Next, we investigated the reproducibility of the above results, and their sensitivity to 
various features of the simulated data. We measured the quality of each estimate /i by 



its mean-square error MSE{fi) = ^ X^felA^ 
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Figure 4.5: Sensitivity of the posterior mean estimate to noise and deformation field 

smoothness on ID data. 



Influence of noise and deformation field smoothness. In a first experiment, we 
simulated the data for different noise levels (e = 4.0, 8.0, 12.0, 16.0) and different values 
of the deformation field smoothness parameter (uj = 5.0, 6.5, 8.0). For each of the 4x3 
possible combinations of these quantities, we generated 100 datasets, which we used to 
estimate /x, using the two methods described above. In particular, the deformation field 
model used for this posterior estimation assumed a field smoothness of coest = 6.5, so 
that it alternatively under or over estimated the true field smoothness. This allowed to 
investigate the behavior of our sampling scheme when using a mis-specified deformation 
field model, which is inevitable when analysing real fMRI data. 

Results of this experiment are presented in Figure 4.5. It can be seen that both posterior 
estimates are very sensitive to noise, since the MSE increases with the noise level e in all 
cases. Also, the difference in MSE values between the estimates with and without spatial 
uncertainty is clearly seen at the lowest noise level considered e = 4.0. However, it tends 
to disappear at the highest noise level e = 16.0. Deformation field smoothness, measured 
here by the parameter uj, also seems to be a critical parameter, with estimations that 
are better for smoother fields. 



Influence of Sample Size. In a second experiment, we studied how the performance 
of the posterior mean estimate varied when applied to datasets, simulated as described in 
Section 4.5.1, of increasing sizes n = 20.0, 30.0, 40.0, 50.0. As the sample size increases, 
the posterior distribution of the full model concentrates around the true value of the 
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noise= 4.0; 5imu_sigma= 6.5 




Figure 4.6: Sensitivity of the posterior mean to sample size on ID data. 

parameters, according to Theorem 4.1, unlike the posterior distribution of the model 
without spatial uncertainty, whose asymptotic behavior is unknown. Hence, we expect 
the difference in MSE between the two corresponding estimates of // to increase with 
sample size. This is precisely what we observe in Figure 4.6. 

4.5.2 3D simulations 

In this second numerical experiment, we apply our method to 3D simulated data, and 
show that the observations on simulated ID data are confirmed, specifically, that stan- 
dard voxelwise techniques (not accounting for spatial uncertainty) may lead to over- 
estimating the size of positive effected regions. We also show that this undesirable 
"stretching effect" may be reduced by modeling spatial uncertainty. 

Data simulation 



A synthetic dataset was generated as follows. We defined a volume of 24 x 32 x 32 
voxels, containing two spherical activated regions, with uniform intensity value 5 (the 
background was set to 0) and a fixed diameter of 7 voxels. 

This idealized activation, interpreted as the mean population effect map /x, was slightly 
smoothed (using a Gaussian kernel with standard deviation 0.5 voxels) to emulate the 
partial volume effect present in real fMRI data. It was then deformed according to 
a displacement field u, simulated under the model described in Section 4.3, with one 
control point in each voxel. The standard displacement was taken equal to as = 2.0 
voxels and the field smoothness parameter was set to uj = 4.0 voxels. 
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Figure 4.7: Examples of simulated data, (top) Deformed signals, (bottom) Data 

(deformed, noisy signals). 



As for ID simulations, independent heteroscedastic Gaussian noise was then added to 
each voxel v, the variance of which was taken equal to 1 + s^(v), with (s/e)^(v) ~ X^(l); 
e being the noise level, set to 1.0 in this example. A total of n = 40 pairs (yj,s^) of 
effect and variance maps were sampled in this fashion, as illustrated in Figure 4.7, and 
constitute a sample from the hierarchical model in Section 4.2 . 

Methods compared 



Based on the synthetic observations y^^, . . . , y„, and the hierarchical model in Section 4.2, 
we estimated the original signal fx by its posterior mean E[/x|y], as described in Sec- 
tion 4.4. Importantly, the estimation model used to compute fi differed from the one 
used to simulate the data, in that the deformation fields were defined by two control 
points, instead of one control point in each voxel. This modification was introduced 
to verify that the signal could be correctly estimated, even though the model was mis- 
specified, as is unavoidable in real-world datasets. Finally, // was estimated in the 
model without spatial uncertainty, setting (t| = 0, u = 0, allowing to verify the pres- 
ence of a stretching effect due to unaccounted spatial uncertainty, as was observed on 
the ID example (Section 4.5.1). In order to compare objectively the different meth- 
ods, we again measured the performance of each estimate fi by its mean-square error: 
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Figure 4.8: Posterior sampling of different parameters in the 3D model with and 

without spatial uncertainty 
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As in the ID case, we can see in Figure 4.8 that the Markov chain quickly settles in a 
stable state. A similar srinkage effet is suggested from the estimated spatial standard 
error as = 1-2 lower than the true value, as = 2.0. The estimated between subject 
standard error a = 1.1, is however closer now to its true value a = 1.0. This may be 
because the 3D data used in this simulations include many more voxels (18432) than 
the length of the previous ID data (50 points), and thus allow a much better estimation 
of certain model parameters (though as remains conservatively biased) . 

As previously, the posterior estimate of the group effect is significantly more accurate 
using the full model than its version without spatial uncertainty, as can be seen in 
Figure 4.9. As expected, the latter (right) is visibly more spread out spatially than 
the original signal (left), and the two activated regions are merged into a single blurry 
region. In contrast, the signal estimated in the model with spatial uncertainty (center), 
correctly recovers the two separate activation spheres, which also appear less blurred. 
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These impressions are confirmed by the fact that its MSE (0.11) is lower than the one 
obtained without spatial uncertainty (0.19). ct = 2.1 
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Figure 4.9: Posterior estimates of fi in the 3D-model, with and without spatial 

uncertainty. 



Sensitivity study 

Next, we investigated the reproducibility of the above results, and the sensitivity of the 
posterior estimates with respect to various simulation parameters. 



Influence of Noise and Deformation Field Regularity. We studied the influence 
of both the regularity of the deformations and the amount of observation noise. Thus, 
the field regularity parameter used to simulate the data, ujsim, was chosen in [3.0, 4.0, 5.0]. 
Also, we generated the observation variance Si^k for subject i at voxel k according to 
— '— ~ X (1)) with e, controlling the noise level, varying in [1.0,2.0,3.0,4.0]. 

For all 3 X 4 combinations of these parameters, 100 synthetic datasets where generated 
and used to estimate /x, both with and without spatial uncertainty, as described above. 
In all cases, the estimation model used two control points to define each deformation 
field, and a field regularity parameter ujest fixed to 4.0. It was therefore mis-specified 
with respect to the model used to simulate the data, for which the deformation fields 
where defined with one control point in each voxel, and varying values of the regularity 
parameter uJsim- 

The results of the sensitivity analysis are presented in Figure 4.10. As in the ID case, 
both approaches are sensitive to observation noise, and the MSE is seen to decrease 
when the deformation field smoothness uj increases. 
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n= 40.0; simu_sigma= 3.0 
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Figure 4.10: Sensitivity of posterior estimation to noise and deformation field 
smoothness. Mean-Square Error (MSE) for the estimation of signal (/x), in the model 
with spatial uncertainty (solid line) and without (dashed line). The lines represent the 
median MSE across 10 replications, ploted against the noise level e. Dispersion of MSE 
values is represented by the boxplots. Each sub-figure corresponds to a different value 
of the deformation field smoothness parameter uJsimi which increases from left to right. 

noise= 1.0; 5imj_sigma= 4.0 




Figure 4.11: Sensitivity of posterior estimation to sample size. 

Mean-Square Error (MSE) for the estimation of signal (/x), in the model with spatial uncertainty 
(solid line) and without (dashed line). The lines represent the median MSE across 100 replica- 
tions, ploted against the number of observed images n. Dispersion of MSE values is represented 
by the boxplots. 

Influence of sample size. Finally, we studied the behavior of our approach when the 
sample size varied, choosing n = 20, 30, 40, 50. (see Figure 4.11). As in the ID case, the 
drop in MSE due to spatial uncertainty modeling is seen to increase with the data size. 



4.5.3 Conclusion 



The simulations presented in the above sections 4.5.1 and 4.5.2 show how the modeling 
of unknown deformations can lead to significant improvements in the estimation of a 
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spatially warped signal. 

The sensitivity to noise of our estimation method is also highlighted in these illustra- 
tions. This sensitivity may be over-estimated here due to the slow mixing of our sampling 
algorithm, which in particular under-estimated the spatial uncertainty parameter. Nev- 
ertheless, this suggests that modeling spatial uncertainty is most useful in cases where 
the data is not too noisy, and motivates the use of a moderate preliminary smoothing step 
when processing the individual fMRI datasets (see Section 2.2.1). We would however 
recommend nonlinear smoothing strategies, such as cortical surface-constrained filtering 
[Andrade et al., 2001] or anisotropic diffusion [Kim et al., 2005], in order to limit the 
blurring effect inherent to Gaussian and other linear filters. 

Finally, our simulations suggest that the field regularity parameter Wgst of the estimation 
model should ideally be not larger than the actual deformation field smoothness to 
optimally account for spatial uncertainty. It should also not be too small to avoid 
over-fitting and computational issues (since more control points would be necessary to 
define less regular fields). However, tuning ujest remains an issue in real-life applications, 
since we see no way of estimating the actual regularity of the deformations, and their 
distribution may be very far from that specified in Section 4.3. 

4.6 fMRI data 

We conclude this chapter on spatial incertitude modeling by an illustration on real fMRI 
data. We used the Localizer dataset [Pinel et al., 2007], which is described in Section 3.4, 
restricted to a group of 38 right-handed subjects. Moderate preliminary smoothing by 
a 5mm Gaussian kernel was first used, in order to increase the SNR, as justified in 
the previous section. Then, each subject's data was pre-processed and analyzed using 
SPM5, as explained in Section 2.2. 

For each contrast of interest c considered, the subject's BOLD effect maps y, and es- 
timation variance maps s? where calculated, following the notations in Section 2.3.2. 
These were used as input data to estimate the mean effect map /x, as described in Sec- 
tion 4.4. As in the simulation studies, we estimated /x both in the model with spatial 
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uncertainty, and in the model without spatial uncertainty, setting as = 0. The deforma- 
tion field smoothness was set to cj = 4.5 voxels, and each deformation field was defined 
by 60 equally spaced control points. 

100 iterations of the Gibbs sampler were used to average the posterior mean, following 
100 steps which were discarded, corresponding to a burn-in period. This took approxi- 
mately 1 hour and 40 minutes, on a PC with 2.8 GHz clock rate. 

4.6.1 Number processing task 




Figure 4.12: Posterior sampling of as on real fMRI data. 



Behavior of the Markov chain is illustrated in Figure 4.12, by the sampled values of the 
standard spatial displacement 175. At first glance, it seems to settle almost instantly in 
a stable state, presumably next to a local mode of the posterior distribution. 

However, the average value of as increases slightly across the iterations, suggesting that 
the chain is in fact progressing extremely slowly across the parameter space, and is still 
far from its stationary distribution. This implies that the Monte-Carlo approximation 
to the posterior mean, as = 0.61 voxels, corresponding to 4.3mm, is probably lower than 
the true posterior mean. 

Even so, the effect of spatial modeling on the posterior mean estimation is clearly visible 
in Figure 4.13. The posterior mean effect map under spatial uncertainty (middle) is 
clearly smoother than the map obtained under no spatial uncertainty (left). It is also 
slightly more contrasted, with values ranging from —14.9 to 15.7 against —13.8 to 12.7 
when not accounting for spatial uncertainty. 
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Figure 4.13: Posterior estimates of fi for a number processing task. 
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This regularization effect can be explained by the fact that the mean effect map is 
averaged across a large number of displacements, sampled according to their posterior 
likelihood, resulting in a smoothing effect. Thus, it is natural to wonder whether spatial 
modeling does not act merely as a spatial filter, comparable to the linear and isotropic 
smoothing customarily applied to fMRI data. 

We investigated this question by smoothing the posterior mean effect map obtained with- 
out spatial uncertainty, using a Gaussian filter with FWHM equal to 4, 3mm, according 
to the standard spatial displacement estimated in the model with spatial uncertainty. As 
can be seen in Figure 4.13, right, the result is very different from the spatially uncertain 
posterior mean effect map, in that the image is smoother, but with less contrast, the 
values ranging from —12.3 to 11.6. 

Thus, the regularization induced by our spatial modeling approach is seen to be highly 
anisotropic, as it enhances the salient features of the effect map, while reducing the 
background noise. This effect is reminiscent of other Bayesian hierarchical model- 
ing approaches to fMRI data analysis discussed in Sections 2.2.3 and 2.6.1, such as 
[Gossl et al., 2001, Woolrich et al., 2005, Woolrich and Behrens, 2006, Smith and Fahrmeir, 2007, 
Penny et al., 2007, Makni et al., 2008]. In all of these, a similar anisotropic smoothing 
effect is obtained by explicitly modeling correlations between neighboring voxels through 
a hidden discrete- valued Markov field. This Markov field describes the state of each voxel, 
i.e., whether it is active, inactive, or de-activated. Yet another approach can be found 
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in [Harrison et al., 2008], which uses a non-stationary spatial regularisation prior on the 
effect map, which promotes regularization while preserving activation contours. 

However, it is worth noting that all the above methods are concerned with single-subject 
analysis only, and promote regularization by imposing constraints on the subject's ac- 
tivation map. This is very different from our spatial uncertainty model (4.2), which is 
defined at the between-subject level. Indeed, the regularization of the group effet map /i 
is essentially a by-product of integrating out unknown spatial deformations u applied 
to the individual activation maps. The posterior distributions of these deformations are 
more sharply peaked in regions were the individual images are easily matched (such as at 
the border of an activation region) , than in regions where the matching is unclear (such 
as the background). This explains how the anisotropic smoothing effect is obtained. 
Their is little chance that the exact same effect could be achieved by only modeling the 
group effect map /x, according to one of the approaches cited above, which would ignore 
correlations between individual images. 

4.6.2 Language processing task 




Figure 4.14: Posterior sampling of as on real fMRI data. 



We obtained similar results when analyzing data from a language processing task. The 
estimate of the standard spatial displacement as = 0.59, corresponding to a FWHM of 
4mm, was only slightly smaller, and the mixing of the Markov chain is also seen to be 
very slow from Figure 4.14. 

The gain in contrast and regularizing effect due to spatial modeling is also apparent from 
Figure 4.15, and we noticed the same marginal increase in contrast, with extremal values 
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of —33.4 and 20.0 against —31.6 and 19.0 This combined effect could not be reproduced 
by a simple isotropic smoothing with the same FWHM value of 4mm. 

Figure 4.15: Posterior estimates of /j, for a language processing task. 
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4.7 Conclusion 



We have devised an approach for Bayesian inference on a spatially distorted signal, 
which can be applied to fMRI data. On simulated data, this approach estimated the 
original signal from noisy, warped observations with significantly more accuracy than 
when ignoring spatial uncertainty. The latter resulted in a blurred estimate, and the 
merging of neighboring signal peaks. The benefits of spatial modeling were also assessed 
on a large number of simulated datasets, with different values for the noise level, sample 
size and deformation field smoothness 

Applied to real fMRI data, our method yielded smoother and more contrasted posterior 
mean effect maps when modeling spatial uncertainty. This effect could not be repro- 
duced by naive linear isotropic smoothing. These results are very encouraging, and 
illustrate the benefits of accounting for the spatial uncertainty present in neuroimag- 
ing data through proper statistical modeling rather than isotropic smoothing, as is the 
common heuristic. The main difficulty we have encountered is the slow mixing rate of 
the Markov chain under spatial uncertainty, resulting in a conservative estimate of spa- 
tial uncertainty. This leaves space for further improvement, through the exploitation of 
more sophisticated sampling techniques. Thus, the proposal density for the Metropolis 
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Hastings algorithm used to sample the elementary displacements is clearly sub-optimal; 
testing alternative proposals appears a promising line of work. Another possible strat- 
egy would be to design a tempering scheme to 'flatten' the landscape of the posterior 
distribution [Marin and Robert, 2007], which would allow the Markov chain to move 
more freely around the parameter space. 



Chapter 5 

A Bayesian model selection 
approach to the detection of 
functional networks 



Abstract 

In this chapter, we introduce a new paradigm for ROI-based fMRI group data analysis 
that overcomes certain hmitations of the SPM-Hke approach. Using a Bayesian model 
selection framework, the functional network associated with a certain cognitive task is 
selected according to the posterior probabilities of mean regional activations, given a pre- 
defined parcellation of the brain. Thus our approach is threshold-free, while allowing to 
incorporate prior information, provided that the parcellation is sensible. Furthermore, 
by controlling a Bayesian risk, our approach balances false positive and false negative 
risks. Finally, it is based on the same spatial uncertainty model as in Chapter 4, and 
thus accounts for the mis-alignment of individual images, due to inevitable registration 
errors. As a consequence, the assignment of each voxel to a region is random rather 
than deterministic, and differs from one subject to another. 

Results on simulated data show that badly localized effect can cause inactive regions 
to be detected by mistake. This bias toward false positives is reduced when modeling 
spatial uncertainties. However, the posterior probabilities estimates in the model with 
spatial uncertainty are numerically unstable, presumably because of the slow mixing of 
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the Metropolis-Hastings algorithm used to simulate the displacement fields. We therefore 
propose a more stable approximate procedure, which consists in fixing the displacement 
fields to their most probable value a posteriori. This does not entirely reduce the bias 
toward false positives, which is further compensated through an additional penalty on 
model fit. The final procedure is validated on a simulated dataset. 

5.1 Introduction 

In the previous chapter, we have dealt with the assumption of perfect match between 
individual brains. This assumption was relaxed, by incorporating into the mass uni- 
variate model specified by (2.5) and (2.6) (see Section 2.3.2) a set of hidden variables, 
representing spatial normalisation errors. These are modeled as multivariate random 
fields, as described in Section 4.3. 

We now define a more general hierarchical model for the data, based on this model- 
ing of spatial uncertainty, and on a pre-defined parcellation of the brain volume into 
regions that are assumed functionally homogeneous, as described in Section 5.2. This 
model is conveniently represented by its directed acyclic graph (DAG) [Jordan, 2003], 
as illustrated in Figure 5.1, showing its conditional dependence structure. 

Based on this model, we define each region as involved in the task under study if it 
contains a nonzero mean effect, and inactive otherwise. Thus the unknown functional 
network we wish to recover is defined as a partition of regions into involved and inactive. 
Each partition defines a different generative model for the data. Therefore, selecting the 
right functional network can be seen as a model selection problem. In Section 5.3, we 
propose to do so in a Bayesian framework, rating each candidate network in terms of 
its posterior probability, given a prior distribution over all model parameters, defined in 
Section 5.4. 

Because regions are defined beforehand, this method is threshold-free, while taking ad- 
vantage of the prior knowledge about functionally homogeneous regions, provided that 
the parcellation is sensible. By controlling a Bayesian risk, our approach balances false 
positives and false negatives, with relative weights that may be tuned depending on the 
application. 
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Figure 5.1: Directed acyclic graph (DAG) of the full hierarchical model. Gray 
circles correspond to fixed quantities (hyperparameters and observations), white circles 
to random quantities (latent variables). The conditional distribution of the subject- 
specific hidden effects x^ and their estimation y^, are described in Section 5.2.2, while 
the elementary displacements w^ are modeled in Section 4.3; the distribution of the 
population mean effect /.t, conditional on the regional parameters (r/,1/^), is defined in 
Section 5.2. Finally, the prior distribution on the model parameters (rj, u^, (T'^,ag) and 
the indicator variable 7, is specified in Section 5.4. 



The idea of using pre-defined regions of interest (ROIs) for fMRI group data analysis 
has already been exploited in [Bowman et al., 2008], but under the implicit assumption 
that individual images are perfectly registered. As shown in Section 4.5, activations 
estimated by mass univariate detection methods that neglect the spatial uncertainty 
caused by inevitable registration errors are swelled. This 'stretching' effect may in turn 
cause inactive regions to be detected by error, a fact illustrated in Section 5.7. 



One of the main contributions of our work is to relax this assumption of perfect reg- 
istration, by combining the spatial uncertainty model presented in Chapter 4 with the 
regional response model in Section 5.2. Thus for each subject, the membership of a 
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voxel to a given region is probabilistic rather than deterministic. Because the member- 
ship probability accounts for the subject's own functional data, this effectively allows 
to de- weight the contribution of individual activations to regions they are likely not to 
belong to, but rather have been projected onto due to spatial normalization errors. In 
Section 5.7, we show that accounting for this uncertainty indeed makes it possible to 
reduce the risk of falsely selecting a region as active. In contrast, using a probabilistic 
atlas, such as found in FSL, would not provide such subject-specific information, but 
rather inform on the uncertainty in the definition of the region itself. 

The main technical difficulty of our approach is the computation of the posterior prob- 
ability of each functional network in the model under spatial uncertainty, because it 
involves evaluating complex integrals on high dimensional spaces. Section 5.5 describes 
how these integrals can be evaluated numerically, using Monte-Carlo Markov Chain tech- 
niques. However, these techniques turn out to be too computer intensive and numerically 
unstable to be used in practice. Instead, in Section 5.8 we introduce an approximation 
based on posterior modes, which requires much less computer time, and is more stable 
numerically. However, this approximation is less efficient in compensating the stretch- 
ing effect due to displaced activations, so that the bias toward false positives, though 
reduced, is still present. In Section 5.8.3, we compensate for this residual bias by in- 
corporating an additional penalty on model fit, calibrated on simulated data. The final 
procedure is validated on a simulated dataset in Section 5.8.4, and we conclude by a 
discussion in Section 5.9. 



5.2 Regional response model 

Our approach to functional network selection is based on an a priori partition of the 
search volume V into N disjoint regions of interest V = Vi U . . . U Vat, assumed to be 
homogeneous functional areas. More precisely, we re-define the population mean effects 
fik as spatially independent Gaussian random variables, identically distributed within 
each region. 

Thus, for all region j = 1, . . . ,N, and for all voxel k such that v^ G Vj : 

l^k = Vj + Xk] Xfc *~' AA(0,f|), (5.1) 
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In other terms, //fe, previously defined as a voxelwise fixed effect (see Section 2.3.2), is 
now expressed as the sum of a regional fixed effect rjj, representing the average BOLD 
response in region j, and a voxelwise random effect Xk representing the variability of the 
response across voxels. 

The same idea of modeling fMRI data based on fixed parcels containing voxels with 
similar BOLD responses can be found in [Bowman et al., 2008] (see Section 2.6.2). Fur- 
thermore, in this work the regional means are modeled jointly as a Gaussian vector, 
with arbitrary covariance matrix, allowing to measure functional correlations between 
distant regions, as well as between voxels from a same region. Though modeling these 
correlations is not our primary goal, and has not been considered here for simplicity, it 
does constitute a promising direction in which our approach could be extended. 

5.2.1 Generative model without spatial uncertainty 

We start by assuming that the estimated effect maps (yj)i<i<n are perfectly aligned. 
Under the assumptions of the regional model (5.1), the within-subject (2.5) and between- 
subject (2.6) models defined in Section 2.3.2, we specify a generative model for the the 
data, with the further assumption that the between-subject variance o"^ in (2.6) is now 
uniform within each region j. As discussed in Section 4.2, this modification will be 
necessary when modeling spatial uncertainty in the next section, to avoid overfitting the 
data. 

Thus, for all subject i = 1, . . . , n, all region j = 1, . . . , A^ and all voxel k such that 

yi,k = Xi^k+£i,k; Ei^k'"^' ^f{0,slf,) (5.2) 

Xi,k = fj'k + Ci,k; ^i,fc *~' AA(o,(t|), 

(5.3) 

where we assume the mutual independence of the noise processes ^i^k and £i^k- 
Using the expression of fik in (5-1), we can re- write the above model as: 



yi,k = Vj + Xk + Ci,k + £i,k- 



(5.4) 
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This formulation corresponds to a mixed-effect analysis of variance (MFX-ANOVA) 
model with a single factor, whose A^ levels correspond to the different regions. In fact, 
since the noise processes are assumed independent across voxels, it is simply the aggre- 
gation of N independent models, one for each region j. 

This generalizes the mass univariate model, defined in Section 2.3.2, which corresponds 
to the limiting case N = d, that is, when regions reduce to single voxels, under the 
identifiability constraint: i/? = 0. In this case, the regional mean rjj coincides with the 
population mean effect /x^, and the regional between-subject variance cr| becomes voxel 
dependent. 

5.2.2 Generative model with spatial uncertainty 

We now relax the assumption that the individual images are perfectly normalized, so 
that the between-subject model (2.6) is replaced by its generalization to spatial uncer- 
tainty (4.2), introduced in Chapter 4. Spatial normalization errors are defined for each 
subject i as a spatial deformation field Uj, controlled by a set of hidden variables Wj, as 
explained in Section 4.3. 

As in the previous section, we combine this observation model with the regional model 
(5.1), and assume the between-subject variance o"^ to be region-dependent. It turns out 
that, conditionally on the hidden displacements variables Wj, the data is still distributed 
according to a MFX-ANOVA model, independently across regions. However, (5.2) must 
be adapted to account for observations being displaced across regions. 

Consequently, for all subject i = 1, . . . ,n, all region j = 1, . . . ,N,we have for all voxels k 
such that Vfc + Uj^^ G Vj : 

yi,k = Xi^k + £i,k; ei,fc *~' A/'(0,s^_fc) (5.5) 

Xi,k = M(vfc + Uj,fc) +,^j_fc; ^j,fc|wj '^■7\A(0,ct|). 

Again, using (5.1), this can be re-written as: 

yi,k\^i = r]j + X{^k + Ui^k) +Ci,k+£i,k- 

(5.6) 
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Thus, the main difference with the case of no spatial uncertainty is that the mean of any 
given observation yi^k now depends on the region j toward which the voxel k is displaced, 
conditional on Wj, rather than the region it belongs to. The same observation holds for 
the conditional variance cr? of the hidden effects Xj fc. Finally, because the set of voxels 
displaced to region j fluctuates with w, the data is no longer independent across regions, 
after integrating out w. 

This model generalizes all the ones introduced in the previous sections and chapters of 
this work. Indeed, the regional model without spatial uncertainty is the special case 
as = 0, while the mass univariate model in Section 2.3.2, is obtained for as = and 
N = d. Finally, the model introduced in Chapter 4 corresponds to the case of a single 
region (N = 1). 



5.3 Bayesian model selection framework 

Based on the regional response model, we now propose an answer to the basic question 
underlying most fMRI paradigms, namely, that of selecting the subset of regions that 
are involved in the task at hand. More precisely, we define region j as: 

• active if r]j > 0; 

• negatively active if rjj < 0; 

• inactive if tjj = 0, 

and define a region j as involved if it is either active or negatively active. From a 
hypothesis testing perspective, we wish to test, for each j = 1,. . . ,N, T-Iqj : ''rjj = 0' 
versus liij : '%• 7^ 0'. 

Alternatively, recovering the subset of regions that are involved in the considered task 
can be seen as a model selection problem, by introducing the indicator variable 7 = 
(7i,...,7Ar) G {0,1}, such that for all region j: 7^ = means that r/j = 0, and 
7j = 1 means that rjj 7^ 0. Each of the 2 possible values of the indicator variable 7 
then defines a different generative model 9Jt-y describing the data y, corresponding to a 
different subset of explanatory variables r]^ = (r]j)j^^^=i. In the following, we will refer to 
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7 as the functional network variable, since it defines the network of regions functionahy 
engaged in the task under study. 

According to these notations, our task consists in selecting the 'best' (in a certain sense) 
model dJl-y to describe the data y, i.e., the best segmentation of the regions into involved 
and inactive. Equivalently, it can be formulated as an estimation problem, with 7 as 
the interest parameter. The Bayesian approach to this problem consists in considering 
the unknown quantities 7 and = [r] , u'^ , cr'^ , a^) as random variables, and defining a 
prior density 7r(0|7)7r(7), representing the prior knowledge on their possible values. 

Note that, from a frequentist viewpoint, limiting to the variables cited above makes 
sense, since these are the unknown, but fixed, quantities, we wish to infer, as opposed to 
z = (x, w,/!), which are unknown random variables considered as nuisance factors, and 
marginalized out during the inference. In the Bayesian setting we have adopted on the 
other hand, this opposition has no real justification, since both 6 and z are vectors of 
hidden random variables, whose posterior distribution we wish to determine. However, 
we will see that these notations are convenient in the context of the model selection 
algorithm proposed in Section 5.5.1. In particular, the expression of the complete density 
f{y,z\0) is that of a curved exponential model, allowing the evaluation of the MAP 
estimates of 9 using an efficient MCMC-SAEM algorithm, as described in Appendix E. 

Based on the prior 7r(0|7)7r(7) and on the likelihood function f{y\0) of the full hier- 
archical model specified by (5.5), (4.3) and (4.4), the posterior distribution of 6 under 
each model OJI-y can be computed according to Bayes' theorem: 






where 



m(y|7) = I f{y\e)7T{eh)dO (5.8) 

is the marginal likelihood, or evidence, of model OJI-^. This quantity is central to Bayesian 
model selection, since the posterior distribution of the indicator variable can be expressed 
as: 
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-rrf^UA m y7 TT 7) 

Z^-y"i(y7 vr 7 



Assuming that ?n.(y|7) can be computed for all possible values of 7, the most probable 
functional network given the data can be selected, according to: 



7 = argmax7r(7|y), 
7 

= argmaxm(y|7)7r(7). 
7 



Besides being intuitive, this procedure has a number of attractive features. For instance, 
because 7 takes only discrete values, its MAP estimate 7 is also the Bayesian estimator 
associated to the — 1 loss function ^(7,7) = X^,- 17 7^7 ) i-s., it minimizes over all 
possible networks 7 the Bayesian risk: 



^[^(7,7)|y] = Yl ^(7,7)^(7|y)- (5.10) 

76{0,l}^ 

From a multiple testing perspective, this means that 7 minimizes the a posteriori ex- 
pected sum of type I (false positive) and type II errors (false negative) errors. Other 
loss functions could be used, to give different weights to false positive and false negative 
risks. Thus our Bayesian decision-theoretic framework answers one of the limitations of 
the SPM-like approach, which only controls type I risks. 



5.4 Prior specification 

We now address the choice of a prior distribution for the functional network 7, and the 
model parameters 6. Under limited information about their possible values, it is natural 
to use a weak prior. This choice is problematic in generalized linear mixed models 
(GLMMs), because Jeffreys prior may be intractable, and the standard scale invariant 
(improper) prior may lead to an improper posterior [Hobert and Casella, 1996]. Instead, 
we have adopted normal priors on the effects and inverse-Gamma priors for variance 
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components. These are practical choices, their conditional conjugate form making them 
amenable to posterior samphng. Our prior is specified as follows: 

5.4.1 Regional means 

For each region j, we define: 



VjW],lj = ~ '^(O) (5.11) 

VjWj,lj = '^ ~ AA(m,i/|/A), 

where 5{0) is the Dirac mass in 0, meaning that in inactive regions, the regional mean 
vanishes almost surely; in the other regions, the prior mean is set to ttt. = so as not 
to bias the inference towards positive or negative effects and, the scale parameter is set 
to A = 10^'^, which may be interpreted as the weight given to the prior mean m with 
respect to one observation. 

5.4.2 Variance components 

All variance components share the same prior: 

N 

7r{alu^,a^) = ig{ala, ^)ll{ig{u];a, (3)ig{a];a, (3)} (5.12) 

where ig{a,P) is the Inverse-Gamma distribution with parameters (a,/3), and density 

function 

ig{ z-a,P) = J^.z-'^-' exp (^) . 
T{a) \ z J 

Tuning a, and /3 is a difficult task in absence of prior information. A popular choice is 
to use a "just" proper prior in absence of prior knowledge, given for instance by a = 
f3 = 10~^ [Spiegelhalter et al., 1996]. This common practice has raised some concerns 
[Natarajan and McCulloch, 1998], because they can result in poorly behaved posterior 
sampling schemes, due to the near impropriety of the resulting posterior. 
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Thus we chose values which reflected the hmited amount of knowledge available on the 
interest parameters, while defining truly proper priors, setting a = 3, /3 = 20. This means 
for instance that in each region j, the population variance a'j has a prior mean of 10, 
and a prior variance of 10^, which seems reasonable from practical experience. The same 
considerations hold for the regional variances i^?, and the spatial variance parameter (t|. 

Several alternative priors are proposed in [Natarajan and Kass, 2000], including a uni- 
form shrinkage prior, and an approximate Jeffreys prior. However, we found the results 
obtained by the standard priors satisfying enough not to consider these more complex 
strategies. 

5.4.3 Indicator variables 

We define independent Bernoulli priors for the indicators variables: 



TV 

vr(7) = l{B{jf,l,pj), (5.13) 



meaning that region j has a prior probability of pj of being activated. In the following, 
we use the default choice pj = 0.5, but prior information on the state of regions can 
easily be included at this stage, such as results of previous analyses. 

5.5 Evaluating the marginal likelihood 

We now address the computation of the marginal likelihood m(y\^), defined by (5.8). 
As is often the case, this is a difficult task, because it requires integrating the probability 
density function (pdf) f{y\6), defined on a high dimensional space, with respect to the 
prior density 7r(0\y), which cannot be done analytically in our case. 

Thus, we turn to numerical approximation strategies. The main challenge in evaluating 
m(y|7) is that the integration must be done with respect to the prior density 7r(0\^), 
but the values significantly contributing to the integral are concentrated around the high 
density points of the posterior 7r(0|y, 7). We start by briefly reviewing some of the main 
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strategies that have been developed to deal with this classical problem, and discuss their 
relevance to the present case. 

Importance sampling 

A naive way of estimating m{y\y), justified by (5.8), would be to generate a sample 

idg)i<g<G from the prior distribution 7r(0|7), and deduce the following Monte-Carlo 

estimate: 

G 

9=1 

As mentioned earlier, the significant values of the likelihood function /(y|0) are concen- 
trated on a small region of the support of 7r(0|')'), so most of the sampled values f{y\Og) 
are likely to be close to zero. Hence, this estimate, though unbiased, would have such a 
high variance that it would be useless. 

A natural alternative to sampling under the prior, known as importance sampling, is to 
use a proposal distribution q{d) instead, noting that: 

Thus, if {dg)i<g<G is a sample from q{6), an unbiased estimate of m(y|7) is given by: 

The likelihood function f{y\0) is unknown in the model with spatial uncertainty, but 
this method could still be applied, replacing 6 by (0,w), since the density conditional 
on the displacement parameters w can be computed explicitly (see Appendix F). 

Thus, rhq{y\'y) is still an unbiased estimator of ?Ti(y|7). Its variance, which depends on 
the proposal, is given by: 

v'm,(y|7) = i / ( ^'>''y^' - ™(y|7))\«')<i9. 

In particular, the variance tends to zero as q{6) gets closer to the posterior density 
7r(0|y,7). Thus, the main difficulty of this approach is to choose q{6) 'close to' 7r(0|y,7) 
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which can indeed be hard when httle is known on the shape of the posterior density, as 
is the case here. 

A generahzation of importance samphng is bridge sampUng [Meng and Wong, 1996], 
based on the following identity: 

f f{y\e)7r{e\-f)h{e)q{e)de 

""^^1^^ Jq{d)h{e)7r{e\y,j)de ' 

valid for any choice of functions h and q. Given a sample {Og)i<g<G from q{0) as before, 
and a sample (^j)i<j<j from the posterior density 7r(0|y,')') (in our case, replacing 
by (0, w), this can be done by by the Metropolis- with Gibbs algorithm in Appendix D), 
an unbiased estimate estimate of 'm{y\'-f) is: 

""''^'"*= J-^^...ie^m) ' 

This method requires the choice of an additional function /i(0), and the algorithm re- 
duces to importance sampling when h[6) = 1. [Meng and Wong, 1996] show that certain 
choices of h can reduce the variance of the classical importance sampling estimate, and 
indicate iterative strategies to choose h. However, the choice of a proposal q remains an 
issue in our case, and it is not clear how this method performs for 'bad choices' (what 
happens for instance if q is chosen equal to the prior?). Thus, it is not clear how impor- 
tance Monte-Carlo sampling strategies could be applied in a simple way to the present 
problem. 

Harmonic mean estimator 

The harmonic mean estimator (HME) in [Raftery et al., 2007] constitutes a very simple 
strategy to estimate the marginal likelihood from the output of any posterior sampling 
scheme, based on the harmonic mean identity: 
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Thus, given a sample (^j)i<j<j from the posterior density, an unbiased estimate of the 
marginal likelihood is given by: 

-1 
/ .^ 1 \ 

mHMiyh) 



y^J(y\o,>) ' 



As with importance sampling, this method could easily be applied in our case, replacing 6 
by (0, w), and sampling their joint posterior density using the sampling scheme described 
in Appendix D. 

In spite of its appealing simplicity, the HME is known for its lack of numerical stability, as 
measured by the variance: Var[/(y|0)^-'^|y], making its applicability hazardous. Indeed, 
this variance is determined by the second moment: 



1 [Ti{e\-t) 



m{y\i)J f{y\e) 



de. (5.14) 



For the above integral to be finite, the prior distribution ti{0\^) must have lighter tails 
than the density f{y\0) (viewing the latter as a function of 6). Intuitively, this means 
that the HME is numerically stable when the prior is more sharply peaked, i.e., provides 
more information on the parameter of interest, than the data! [Raftery et al., 2007] 
show for instance that this is not the case in the simple Gaussian model, with the 
usual conjugate inverse Gamma-Gaussian prior distribution on the Gaussian mean and 
variance (as defined in Section 5.4). 

[Raftery et al., 2007] indicate possible ways of stabilizing rhHM{y\^)^ which consist in 
replacing f{y\6j) by a marginalized version /(y|/i(0j)), for an arbitrary function h. 
It is shown that in some cases the modified estimator has finite variance. However, 
the choice of h is strongly model-dependent, and we have found in our case no such 
convenient function ensuring finite variance for the HME. 

Variational Bayes 

The variational Bayes (VB) framework [Beal, 2003] can be used to compute a lower 
bound on the marginal likelihood. This bound originates in an identity similar to that 
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upon which the importance sanipHng approach is based, 



m(y|7)= fq{z,e) ^^^f'y^ dzde, 



vaHd for any proposal distribution q{z,6), with z = (x, /x,w). Taking the logarithm 
on both sides, and applying Jensen's inequality yields (thanks to the concavity of the 
logarithm) 

logm(y|7) > q{z,e)log J-^-r — dzd9. 

J q{z,0) 

Thus we obtain a lower bound on the marginal likelihood, noted J--y{q) in the following. 
This inequality becomes an equality for g(z, 6) = 7r(z, 6\y, 7), but of course this proposal 
cannot be used directly, because its normalizing constant is unknown, since it is precisely 
the marginal likelihood we wish to compute. 

The goal of VB approaches is to maximize J~-y{q) with respect to q, restricting the 
latter to a class C of functions such that the maximization may be performed efficiently. 
Typically, C is defined as a set of functions which are factored across several blocks of 
latent variables, for instance C = {q\q{z,6) = qz,{z) qo{d)}. In this case, F-^^q) can be 
maximized using the VBEM algorithm, which alternates maximizations over (7z(z) and 
qe{9), until convergence to a local maximum. It can be shown that iteration t of the 
algorithm is given by 



qi, '{z) oc exp 



g(*+^V) « exp 



q^e\0)logfiy,z,eh)d9 
I (?(*+') (z) log /(y,z,0|7)ci: 



(5.15) 
(5.16) 



Several problems arise at this point. Even if the integral in (5.15) can be evaluated 
explicitly, their is little chance that the resulting expression of qz (z) corresponds to a 
known distribution in the model with spatial uncertainty. Further factorizing q^ (z) 
does not resolve this issue, because it is due to the complicated relation between spatial 
displacements and the other variables, as explained in Appendix D. Because of that, the 
integral in (5.16) has no closed form, making the determination of q^ (6) problematic. 

Thus, using a variational Bayes bound to approximate the marginal likelihood in the 
model with spatial uncertainty would most probably either require further approxima- 
tions, or hybrid strategies combining MCMC and VB approaches [Forbes and Fort, 2007]. 
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Though we do not dismiss the use of variational Bayes, it does not seem clear for the 
moment what advantage it would have over the other approaches, or how it could be 
implemented efficiently in our case. 

Reversible jump MCMC 

Another alternative would be to use the reversible jump MCMC (RJ-MCMC) algorithm 
[Green, 1995], which consists in sampling 7 along with the other parameters. This 
approach allows to 'jump' from one model specified by 7 to another, effectively changing 
the size of the parameter vector from one iteration to another. Thus only the models 
most relevant given the data are visited, rather than all the models, and a single run 
of the RJ-MCMC algorithm is necessary for the whole model selection procedure. The 
marginal are not computed directly in this approach; instead, the posterior probabilities 
of each model Tl~f is estimated directly by the proportion of total iterations spent in 
dJt-f by the sampling algorithm. 

Though attractive, this technology has several drawbacks. [Chib and Jeliazkov, 2001] 
indicate that the algorithm can be quite complicated to tune in order to promote mix- 
ing across spaces of varying dimension. Also, each new model introduced in the sampling 
scheme must include a subset of the existing models, which artificially increases the pa- 
rameter space. Finally, as noted in [Marin and Robert, 2007], the parameters of interest 
within each model (and the posterior probabilities 7r(7|y)) may be poorly estimated if 
the algorithm keeps jumping from one model to another. Hence, it is often necessary to 
use hybrid strategies, which alternate reversible jump and classical MCMC iterations. 
Finally, for each model to be well estimated, it seems that the RJ-MCMC would unavoid- 
ably require at least as many iterations as the sum of iterations needed to fit each model 
using more conventional MCMC techniques. Thus, it seems to us that the RJ-MCMC 
algorithm is less appropriate for model selection than for model averaging, and in partic- 
ular to obtain estimates of the parameter 7 which take into account the uncertainty on 
the choice of a model. An example of such an application in the context of fMRI group 
data analysis is given in [Xu et al., 2009], which is presented in Section 2.6.4. However, 
this is not our primary goal here. 



A Bayesian model selection approach to the detection of functional networks 109 

5.5.1 Chib's approach 

We have finally opted for Chib's method [Chib, 1995, Chib and Jeliazkov, 2001], which 
allows to compute the marginal likelihood from the output of virtually any posterior 
sampling scheme. It is efficient, simple and applicable in most cases. This approach is 
based on the basic marginal identity (BMI): 

which simply expresses the fact that the marginal likelihood is the normalizing constant 
of the posterior distribution. It is valid for any particular value 0* of the parameter, 
but it is advised to choose a high density point, such as the posterior mean or posterior 
maximum (MAP), where the different densities are more likely to be well estimated 
than in tails. In our model the MAP cannot be computed analytically, but can be 
numerically evaluated using the MCMC-SAEM algorithm detailed in Appendix E, while 
the posterior mean can be obtained from the Metropolis within Gibbs (MH-Gibbs) 
algorithm described in Appendix D. 

Computationally, the main advantage of (5.17) is that it expresses the marginal likeli- 
hood m{y\'y) in terms of the posterior density 7r(0*|y, 7), rather than the prior density, 
providing a way to use posterior sampling strategies, such as the Gibbs sampler, to 
estimate w.(y|7). Indeed, the posterior density can be written as 

<0*\yn) = /"7r(6>*|z,y,7)7r(z|y,7)dz, (5.18) 

where z = (x,w,/2). Thus, given a sample {6i,zi, . . . ,Oj,zj) of the joint posterior 
density 7r(0,z|y,7), obtained e.g. by a Gibbs sampler, the posterior density can be 
computed by Rao-Blackwellisation [Gelfand and Smith, 1990]: 

1 ^ 

HO*\yn) = j^^(r|z,-,y,7), (5.19) 

i=i 

since the Zj's are asymptotically drawn from the marginal 7r(z|y,7). The obtained esti- 
mate is simulation consistent, i.e. it converges to the true value when J — t- 00. 

This approach works nicely in the special case of no spatial uncertainty {dg = 0, w = 0), 
because then the likelihood function /(y|0*) is available in closed form (simply apply 
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the formulas in Appendix F with w = 0). Furthermore, since in this case the regions are 
independent, as noted previously in Section 5.2, only 2N models need to be estimated 
instead of 2 . 

5.5.2 Likelihood under spatial uncertainty 

Unfortunately, this appealingly simple method cannot be directly applied to the model 
with spatial uncertainty, because the conditional density f{y\w,0*) only is available in 
closed form, but not the likelihood function: 



/(y|^*) = J fiy\w,e*)7T{w\0*)dw. (5.20) 

The difficulty in calculating (5.20) is essentially the same as in calculating the marginal 
likelihood: it is an integral on a high dimension space, with respect to the prior distribu- 
tion tt{w\6*), whereas significantly contributing values of the integrand are concentrated 
around the modes of 7r(w|0*,y). Therefore, we may apply Chib's method again, by writ- 
ing the following alternative identity: 



f( \fi*^ /(y|w^r)7^(w*|r) 

^^^'^^ = vr(w*|r,y) ' ('-'^^ 



valid for any value w* of the elementary displacements. A high density value of the 
posterior ordinate 7r(w|0*,y) is however advisable, for accurate numerical evaluation 
of (5.21). Thus a reasonable choice is the conditional maximum a posterior w^^^p = 
argmaxw vr(w|0*,y), which can be evaluated using the simulated annealing (SA) algo- 
rithm, as described in Appendix G. A simpler alternative, requiring less computations, 
is the posterior mean w = E[w|y], directly available from the output of the MH-Gibbs 
sampler in Appendix D as the average of sampled w values. However, it may be distant 
from the principal mode of 7r(w*|0*,y), and result in a less stable estimator of (5.21). 
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Having chosen w*, both /(y|w*,0*) and it{w*\6*) are available in closed form, but 
not the posterior ordinate 7r('w*|0*,y). This difficulty cannot be solved using the ap- 
proach in [Chib, 1995], which assumes that the hidden variables can be decomposed into 
blocks with conditional densities available in closed form. In the present case, w* can be 
decomposed into blocks consisting of single elementary displacements Wj^, but the con- 
ditional density of each block has no analytical expression, and must be sampled using a 
Metropolis-Hastings step, as described in Appendix G. A generalization of the method 
in [Chib, 1995] to this setting is developed in [Chib and Jeliazkov, 2001]. It consists in 
factorizing the posterior ordinate across blocks, according to: 

n B 



vr(w*|0*,y) = \{Y[7^{^l\^UbX,y) 



i=lb=l 

where we write w_j6 to denote the blocks preceding ih in the lexical order, that is, the 
collection of blocks Wj/f,/ where i' < i, b' < b, and (z', b') ^ {i, b). 

Each reduced posterior ordinate 7r(w*j|w^j^, 0*,y) can then be evaluated from the out- 
put of a reduced run of the multiple block MH algorithm, conditional on w_jfc, and 
an additional run where Wj(, is added to the conditioning set. Justification and further 
details on the calculation of the reduced ordinates and the likelihood function are given 
in Appendix H. 

5.6 Comparing different parcellations 

The choice of a particular parcellation is an important issue, as our whole decision 
framework rests upon it. Yet, the definition of functionally homogeneous regions in the 
human brain remains an open issue. In practice, mis-specified parcellations may cause 
activated areas to cross several parcels, resulting in reduced sensitivity, and difficulty in 
interpreting the detected pattern. 

Though resolving this issue is beyond the scope of this work, we remark that the same 
Bayesian model selection formalism used to select the functional network 7 can be used 
to compare different candidate parcellations. This is done by considering the parcellation 
as a random variable to be estimated, rather than a fixed quantity. Thus, two given 
parcellations ^ = {Vi, . . . , Vn} and ^' = {V[, . . . , V^} may be compared through their 



A Bayesian model selection approach to the detection of functional networks 112 

posterior odds: 



7r(^'|y) m(y|^') tt{^')' ^ ' ' 

where 7r(=^) is the prior probability assigned to parcellation ^, and m{y\^) the evi- 
dence for ^, given by: 

m(y|^) = Y^ m{y\j,^)TT{-f\^). (5.23) 

7e{o,i}^ 

Here 7 is the vector indicating possible functional networks, based on parcellation ^, 
"i(y|7) ^) is the marginal likelihood defined by (5.8) and 7r(7|=^) the prior probability 
of network 7, defined by (5.13). 

Thus, selecting the most adequate parcellation among a set of candidates, to infer the 
functional pattern associated with a certain functional task, is a straightforward applica- 
tion of the variable selection framework introduced in the previous section. Furthermore, 
applied to anatomically defined regions, it provides a tool to investigate the link between 
brain anatomy and function. 



5.7 2D toy example 

We now illustrate our model selection approach on a simulated dataset. In Section 4.5.2, 
our goal was to evidence a stretching effect of the activations due to spatial variability, 
and its compensation through appropriate modeling. Going one step further in our anal- 
ysis, we now show that this stretching effect may result in a bias toward false positives 
when testing the presence of activations within pre-defined regions. Furthermore, we 
show that this bias can be corrected when spatial uncertainty is accounted for. 

We defined a synthetic activation pattern, within a 2D search volume of 24 x 24 voxels, 
consisting of a central activated disc, with uniform intensity value 5 (the background 
was set to 0) and a diameter of 7 voxels. 

To simulate the data, this activation was deformed according to a displacement field u, 
simulated under the model described in Section 4.3, with one control point in each voxel. 
The standard displacement was taken equal to 175 = 1.0 voxels and the field smoothness 



A Bayesian model selection approach to the detection of functional networks 



113 



parameter was set to a; = 4.0 voxels. Independent heteroscedastic Gaussian noise was 
then added to each voxel v, with variance equal to 1 + s^(v), where s^(v)/e ~ X^(l); 
e being the noise level, set to 1.0 in this example. A total of n = 30 pairs (yj,s?) of 
effect and variance maps were sampled in this fashion, as illustrated in Figure 5.2, and 
constitute a sample from the hierarchical model in Section 4.2 . 
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Figure 5.2: Three different simulated effect effect maps y^ 



Data analysis 



The search volume was divided in two regions, corresponding to the the background 
(Vi) and the active disc (V2). In this elementary situation, only 4 activation models 
are in competition, each of them corresponding to a value of the indicator variable 7 G 
{0, 1}^. We applied the Bayesian model selection approach described in 5.3 to recover the 
parcels with a nonzero mean population effect, both with and without modeling spatial 
uncertainty. In the model with spatial uncertainty, deformation fields were specified 
using a single control point in the center of the search volume, and the same regularity 
parameter a; = 4 used to simulate the data. 

More precisely, the marginal likelihood in each model was estimated as explained in 
Section 5.5. Thus, the posterior density of the model parameters was maximized over 
500 iterations of the MCMC-SAEM algorithm (see Appendix E), following 500 'burn-in' 
iterations. Next, the posterior density of the resulting MAP estimate was computed 
using 1000 Gibbs iterations, following a burn-in period of 100 iterations. This was 
sufficient to obtain the marginal likelihood in the model without spatial uncertainty, 
applying (5.17). 

In the model with spatial uncertainty, the density of the deformation fields, conditional 
on the MAP estimate of model parameters, was maximized using 500 iterations of the SA 
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algorithm (see Appendix G). Finally, this density was computed as in section 5.5.2. As 
explained in Appendix H, each reduced run was sampled using a number of iterations 
inversely proportional to the number of sampled blocks, and used 3000 iterations for 
the single-block run. Computing the marginal likelihood of each model under spatial 
uncertainty took approximately 9mn on a PC with a clock rate of 1.33GHz, against 
20s seconds without spatial uncertainty. To quantify the variance of the Monte-Carlo 
estimates of the marginal likelihood values, we repeated all calculations 10 times on the 
same dataset. 

Results 

Presence of the stretching effect can be checked in Figure 5.3, where posterior estimates 
of the signal from one trial are shown. These were computed by averaging over the four 
models the posterior mean conditional on the MAP parameter estimates 0-y, according 
to: fj, = X^-y IE[/x|y, ^-y]7r(7|y) (using the posterior mean E[/x|y,7] would have made 
more sense, but was not directly available from the output of our algorithm). 
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Figure 5.3: Posterior estimates oi fi (left), with (right) and without (center) modeling 

spatial uncertainty. 



The mean and standard deviate over 10 trials of the marginal likelihood values of each 
model, are given in Table 5.1. As expected, when spatial uncertainty is unaccounted 
for, the maximum marginal likelihood is attained for 7 = (1, 1), that is, activations are 
detected in the background, due to the stretching of the estimated activate disc. These 
values are numerically stable, as can be checked from their standard deviates. 

The stretching effect is compensated in the model without spatial uncertainty, where 
the maximum marginal likelihood is achieved on the average for the correct model 7 = 
(0, 1). Moreover, the marginal likelihood values are much higher, indicating the better fit 
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7 


m{y\j,as = 0) 


m(y|7) 


(0,0) 


-87882.68 ±0.12 


-33655.6 ±21.6 


(1,0) 


-87868.39 ±0.17 


-33663.9 ±12.7 


(0,1) 


-87748.58 ±0.10 


-33644.0 ±13.2 


(1,1) 


-87733.29 ±0.18 


-33651.3 ±16.1 



Table 5.1: Log marginal likelihood values computed on 2D simulated data, over 
10 trials. Results are given in the form: mean ± std. deviate. 



obtained by modeling spatial uncertainty. However, the differences between the marginal 
likelihoods of the different models are now swamped in the variability of the Monte-Carlo 
estimates, so that the procedure does not systematically select the correct model on each 
trial. 

This variability comes from the difficulties encountered in sampling the conditional den- 
sity of the elementary displacements, as described in D. As a result, the Markov Chain is 
very sluggish in its exploration of the space of all possible deformation fields, and tends 
to get trapped in local maxima, an issue already noted in Sections 4.5 and 4.6. Because 
on each trial, the Markov chain gets trapped around a different mode, the resulting esti- 
mates of the marginal likelihood are highly variable. Increasing the number of iterations 
did not solve this issue, though in theory it would allow after sufficient time the chain 
to escape from each local mode and explore exhaustively the sampling space. 

In conclusion, this numerical experiment demonstrates the potential benefits of modeling 
spatial uncertainty when testing regional hypotheses on fMRI data, using the model 
selection approach developed in Section 5.3. However, this approach turns out to be 
numerically unstable when modeling spatial uncertainty, even on the overly simplified 
toy dataset used here, with 2D data simulated under a high signal to noise ratio, and 
using smooth deformation fields with known smoothness. 

5.8 Approximate inference using posterior modes. 

To address the above mentioned numerical instability issue, we consider approximating 
the marginal likelihood m{y\^) by the following likelihood, conditional on a particular 
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value w, which is the same for all networks 7 : 

m(y|w,7) = y"/(y|w,0)7r(0|7). (5.24) 

By doing this, we avoid the difficult integration with respect to the displacements. This 
may be interpreted in terms of the Laplace approximation (see [Robert, 2007] for in- 
stance), which consists in approximating the unnormalized posterior density 7Ti(y|w, 7)'7r(w|7) 
of w by an unnormalized Gaussian, obtained from a second order Taylor expansion of 
logm,(y|w,7)7r('w|7) around its mode w. 

Using this approximation would be overly expensive, because the posterior mode w 
would need to be computed for each 2^ possible values of 7, an unfeasible task for 
real datasets, since each model would need at least several minutes to be processed. 
Moreover, the validity of a Taylor expansion of logm(y|w,7)7r(w|7) is questionable in 
our case, since it is only piecewise continuous as a function of w. 

Hence, we simplify the Laplace approximation, replacing ?n,(y|w,7)7r('w|7) by a Dirac 
mass in the posterior mode w instead of a Gaussian, and further assuming w to be the 
same for all networks 7. 

We define w as the following conditional posterior mode in the model without parcella- 
tion, defined in Chapter 4: 



= argmax7r(0|y); (5.25) 



w = argmax7r(w|y, 0). (5.26) 



We use the mode conditional on the MAP parameters 0, rather than the unconditional 
mode, for simplicity, since it can be estimated using the SAEM and SA algorithm pre- 
viously introduced (see Appendices E and D). 

The conditional likelihood (5.24) can be evaluated exactly as the marginal likelihood 
under no spatial uncertainty. Indeed, because voxels are independent conditional on w, 
this expression can be factorized across regions: 

N 

m{y\\v,-f) = J|m(y^|w,7j). 
i=i 
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where y^ = {yi.fc; v^ + Uj^^ G Vj} is the subset of observations corresponding to voxels 
displaced into region j. Note that this set is random, since it is a function of Uj^^, itself 
a function of the elementary displacements w, as defined by (4.3). The conditional 
likelihood of region j is equal to: 



"T'(y^|w,7i) = / /(y^|w,0j)7r(0j|7j)d0j, 



where Oj = {r]j, i/^, a^) is the parameter vector for region j. 

These conditional likelihoods may be evaluated separately, or equivalently, (5.24) can 
be computed for 7 = Oat and 7 = Iat. This can be done by Chib's method (see Sec- 
tion 5.5.1), writing: 

. I-- ^ /(y|w,0^)7r(0^|7) 

"i(y|w,7) = ,;, , . k — ' (5-27) 

7r(6>-),|y,w,7) 

where we choose 6~f as the conditional MAP estimate: 6-y = argmax^ 7r(0|y, w,7). The 
posterior density 7r(0-y |y, w, 7) is obtained by the Rao-Blackwell method, as explained in 
Section 5.5.1, while the exact expression of the density /(y-'lw, 9) is given in Appendix F. 

Finally, the posterior density vr(7|y) of the functional network is approximated by 
■''"(tIYj w), which can be factorized across regions into: 

vr(7|w,y) = n^(^ily'»- (^.28) 

This posterior density is entirely determined by the posterior probabilities that each 
region is involved, 

Pj = ^hj = l|y^ w) = 1 - 7r(7j- = Oly?, w), 

obtained from the conditional likelihoods as 

_ / m(y^|w,7^. =0) ,r(7i=0) \"' 
' ~ V +m(y^|w,7, = l) vr(7, = l)J ' ^'"''^ 

Intuitively, conditioning on w may be seen as a pre-processing step wherein the func- 
tional images are registered to a template //, which is estimated at the same time. Thus, 
it provides a way to compensate for spatial normalization errors based on the functional 
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data. However, because the posterior variability of the elementary displacements around 
their mode is neglected, we anticipate good results only when this variability is reduced, 
that is when the data provides enough information for the elementary displacements to 
be well estimated. 

5.8.1 2D toy example 

We applied the above posterior mode approximation to the same dataset used in Sec- 
tion 5.7 . Our goal was to validate the approximation, and evaluate its numerical 
stability. 

Data analysis 

We used our posterior mode approximation to compute the posterior probabilities Pj of 
positive mean activations within each region j = 1,2. More precisely, the MAP estimate 
9 of model parameters was first obtained without parcelling the search volume, using 
500 iterations of the SAEM algorithm, following 500 burn-in iterations. Then, the con- 
ditional MAP estimate w of the displacement fields was computed using 500 iterations 
of the SA algorithm. The conditional likelihoods m{y\w,^) were then computed both 
under the null model, defined by 7^ = 0, and under the full model, defined by 7^ = 1. 
In each case, the conditional MAP estimate 6~f of model parameters was obtained from 
1 000 iterations of the SAEM algorithm (including 500 burn-in iterations) , and the con- 
ditional posterior 7r(0-y|y, w,'/) was computed from the output of 1000 iterations of 
a MH-Gibbs algorithm, following a burn-in of 100. We then computed the log Bayes 
factors: 

m(yJ|w,7j = 0) 
from which the Pj were directly deduced as 

following (5.29), adopting a uniform prior vr(7j = 0) = '7r(7j = 1) for the state of each 
region. We compared these results with those obtained under no spatial uncertainty. 
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setting w = and a^ = 0, noting that, in this case, the procedure is exact and equiv- 
alent to the one tested in Section 5.7. All computations were re-run 10 times to assess 
numerical stability. 



Results 

Presence of the stretching effect can be checked in Figure 5.4, where posterior estimates 
of the signal from one trial are shown. As previously, these were computed by averaging 
over the four models the posterior mean conditional on the MAP parameter estimates 
and the MAP elementary displacements, according to: /i = ^ E[/x|y,0-y,w]7r(7|y, w). 
Also, it can be seen that the posterior mode w provides in the present case a good 
estimate of the unknown displacements. Indeed, the posterior estimate of /x conditional 
on w has a mean square error (MSE) of 0.11, almost as low as that of the unconditional 
estimate tested in Section 5.7, which was equal to 0.08. 
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Figure 5.4: Posterior estimates of// (left), with (right) and without (center) modchng 
spatial uncertainty, using the posterior mode approximation 





Spatial uncertainty 


No spatial uncertainty 


Region j 


B, 


B, 


1 (background) 
2 (disc) 


-7.16 ±0.27 
3.21 ±0.75 


29.23 ±0.48 
51.89 ±0.01 



Table 5.2: Results of the approximate model selection procedure on a single 2D 

simulated dataset, over 10 trials. 
Results are given in the form: mean ± std. deviate. 
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The mean and standard deviates of the Bayes factor for each region is given in Table 5.2. 
Results for the model without spatial uncertainty are identical to those found in the 
previous experiment, since the procedures are in this case equivalent. They are also 
quantitatively similar in the model with spatial uncertainty: the background is correctly 
classified as inactive, with a negative log Bayes factor Bi, and the disc as active, with 
B2 > 0. Furthermore, the variance of the log Bayes factors is much reduced with respect 
to that of the log marginal likelihoods (see Table 5.1), showing that our approximation 
provides better numerical stability. 

In conclusion, this experiment shows that our posterior mode approximation works well 
in a favorable setting, were it leads to the same conclusions as the exact procedure, with 
a much reduced numerical variability. These good results can be attributed to the fact 
that the data is simulated with a high signal to noise ratio and warped using smooth 
deformation fields with known regularity, so that the posterior density of w is likely to 
be sharply peaked around its mode. 

5.8.2 3D toy example 
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Figure 5.5: Posterior estimates of fj, (left), with (right) and without (center) modeling 
spatial uncertainty, using the posterior mode approximation 



The previous results are encouraging, but does our posterior mode approximation work 
as well on 3D data? To answer this, we used a dataset simulated exactly as in Section 5.7, 
except that the 2D 24 x 24 search volume was replaced by a 3D 24 x 24 x 24 search 
volume, and the central disc was replaced by a sphere with same diameter (7 voxels). 

We used our posterior mode approximation to compute the Bayes factors for both regions 
(background and sphere), using the algorithm detailed in Section 5.8.1. We compared 
these results with those obtained under no spatial uncertainty, setting w = and (t| = 0, 
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Spatial uncertainty 


No spatial uncertainty 


Region j 


B, 


Bj 


1 (background) 
2 (sphere) 


39.55 ±1.2 
264.55 ± 6.82 


80.47 ±0.05 
237.06 ±0.01 



Table 5.3: Results of the approximate model selection procedure on a 3D simulated 

dataset, over 10 trials. 



and repeated all calculations over 10 trials to assess numerical stability. Each trial took 
approximately one hour on a PC laptop with a clock rate of 1.33GHz. Results for the 
3D dataset are given in Table 5.3. As in Section 5.8.1, we report the log Bayes factor 
values Bj for each region j. 

In terms of estimation, the posterior mode approximation again gives satisfying results, 
as seen in Figure 5.5, with a mean-square error (0.03) lower than that of the estimate 
in the model with no spatial uncertainty (0.06). In terms of model selection, results are 
less satisfying, since the background is selected as active, with a positive Bayes factor, 
both with and without spatial uncertainty. 

One possible explanation is that estimating 3D displacement fields requires more infor- 
mation than is provided by the data (which seemed sufficient in the 2D case), so that 
our posterior mode approximation, though it reduces the spreading effect due to the 
mis-localization of individual activations, does not reduce it enough in order for the 
background to be found inactive. Consequently, the Bayes factor conditional on the 
most probable displacements is lower than when displacements are fixed to zero, but 
still positive. 

5.8.3 Additional penalty on model fit 

The above illustrations suggest that approximating the marginal likelihood by fixing the 
displacements to their most probable value works well on 2D data, but on 3D data fails 
to entirely compensate the warping of individual images and the ensuing swelling of the 
estimated activations. Consequently, the systematic bias toward false positives, found 
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when registration errors are not modeled, is reduced but still present when using the 
posterior mode approximation. 

This bias can be seen as a form of data overfit, the number of parameters needed to 
model the data being overestimated, due to likelihood values inflated by displaced activa- 
tions. Note that this bias would be automatically compensated when using the true log 
marginal likelihood log ?Ti(y I7) rather than the conditional approximation log rn-(y| w, 7), 
because it would contain an additional term penalizing the overfit due to modeling spa- 
tial displacements: 

vrfw) 

logm(y|7) =logm(y|w,7) + log ^ ^. 

7r(w|y,7) 

However, the previous example has illustrated the fact that computing this penalizing 
term requires heavy MCMC calculations (specifically, the posterior ordinate vr(w|y,7)), 
and in our case resulted in a far too important numerical variability to be of ay use. 
As a surrogate to this currently unatainable exact quantity, we consider compensating 
the bias by modifying Bj, adding a penalty to model fit, measured by the log likelihood 
ratio 

Tu 1 /(y-'|w,^ii) .^ „„s 

LRj = log : ^^^, (5.30) 

/(yJ|w,0oi) 

where O^j = argmax^, TT{6j\'yj = k)f{y^\w,Oj) for k = 0,1. This quantity is available 
as an output of the algorithm which computes the log Bayes factor Bj, since, following 
(5.27) it can be written as: 



m(y-'|w,7j 
Bj = log 



m(yJ|w,7j = 0) 
1 /(y^|w,0ij) 7r(0ij|7j = 1) 7r(0oi|y^w,7j = 0) 

= log ^^^ + log TT^ + log TT^ 

/(yJ|w,0oj) 7r(0oj|7i = 0) 7r(0ij|yJ, w,7j = 1) 

LRj + Dj. 

(5.31) 

We expect LRj to be always positive. Indeed, the MAP estimates O^j, used to define it 
in (5.30) are likely to be very close to the maximum likelihood (ML) estimates, defined 
by 6j^j = argmaxg. f{y^\w,6j), given that we have chosen a prior tt{0\jj) that is as 
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minimally informative as possible (see Section 5.4). Hence, it is reasonable to expect 
LRj to be close to the log maximum likelihood ratio, a quantity that is always positive 
for nested models. Corroborating this, we have always observed in practice positive 
values for LRj . Hence we have not felt the need to use the ML estimates instead of the 
MAP estimates of the parameter values, though this would be a possible alternative. 

Using the positivity of LRj, we define a lower bound on the Bayes factor Bj by: 



Bj = c X LRj + Dj , 



(5.32) 



for a certain scale factor c G (0, 1), which determines the added penalty. Based on 
this quantity, we define the following lower bound on the posterior probability Pj of a 
nonzero mean activation within region j : 



P, 



Alj = 0) 

1 ^ — ; r 

^ 7r(7,- = 1^ 



1-BiX -' '^ 



ij 



(5.33) 



Pj underestimates the probability of each region being involved in the task at hand. 
Thus, the bias it introduces is always conservative. 



Calibration of the additional penalty on simulated data 
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Figure 5.6: Optimization of the additional penalty on simulated datasets, for the 

posterior mode approximation using the model with spatial uncertainty (left), and using 

the model with no spatial uncertainty (right). 
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Ideally, we would like to choose the factor c small enough so that it will compensate for 
data overfit, but not too small, for active regions not to be missed. We perform this 
calibration using intensive calculations, by estimating the optimal value from a collection 
of datasets simulated under different parameter values. 

As in Section 5.8.2, we defined a synthetic activation pattern, within a search volume 
of 24 X 24 X 24 voxels, consisting of a central activated sphere, with uniform intensity 
value 5 (the background was set to 0) and a diameter of 7 voxels. 

The data was then simulated using a field smoothness parameter c^j = 3, 4 or 5 and a 
noise level e = 1, 2, 3 of 4. For each of the 3x4 possible combinations of these parameters, 
we simulated 10 different datasets comprising n = 30 images, to which we applied the 
procedure in 5.8 to compute Bayes factor values Bj = LRj + Dj for regions j = 1,2, 
using the same algorithm as in Sections 5.8.1 and 5.8.2. 

Next, for all c = 0, 0.01, 0.02, . . . , 1, we computed the modified criterion Bj{g,c) = 
c X LRj{g) + Dj{g) for all 120 datasets, indexed hy g = 1, . . . , 120 and measured the 
proportion of corresponding mis-classified regions, given by: 

120 



^(•^^ - 2^ \^B^(c,g)>0 + ^B2{c,a)<o\ 



9=1 

Finally, we chose the value of c which minimized R(c). 

As illustrated in Figure 5.6, left, a minimum of 2 misclassified regions (out of 420) was 
obtained for c = 0.05. The surprisingly low number of classification errors obtained 
after optimizing c may suggest that it is the introduction of this factor, rather than the 
posterior mode approximation, that is effective in correcting data overfit. 

To verify this assertion, we also computed for each dataset the Bayes factors Bj using 
the model without spatial uncertainty, computed the modified criterions Bj for different 
values of c and chose the value optimizing the classification rate. As illustrated in 
Figure 5.6, the minimum number of mis-classified regions, obtained in this case for 
c = 0.06, was much higher, and equal to 45. This indicates that the good classification 
score achieved using the model with spatial uncertainty is due to the combination of the 
posterior mode approximation, as well as the additional penalty. 
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5.8.4 Phantom activations 




^.^■^ 
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Figure 5.7: Data simulated using phantom activations. Slice z — 37mm in Talairach 
space. From left to right: Synthetic activation pattern (with CSA atlas in the back- 
ground); CSA atlas (numbers correspond to region index in Table 5.4); simulated data 

example. 



We conclude this chapter by applying our final procedure to a dataset which presents 
some similarities to real-life situations, and in particular, which was not simulated under 
the full hierarchical model (illustrated in Figure 5.1) used to analyze the data. 



Data simulation and analysis 



To start with, we defined a 'life-size' search volume of 45 x 62 x 52 voxels, corresponding 
to the actual dimensions of true fMRI images. Then, we designed an artificial activation 
pattern, based on the cortical sulci atlas (CSA), developed in [Perrot et al., 2008], and 
which we used to analyze real fMRI data (see Chapter 6). Each activation was defined 
in terms of a peak location, from which the signal decreased radially according to a 
Gaussian kernel. Each signal peak was taken equal to 5. We placed two activations in 
neighboring regions, one at the intersection of several regions, and a smaller one inside 
the largest atlas region (see Figure 5.7, region 7). 

n = 40 images were generated by warping this map according to the deformation model 
defined by (4.3), with one control point in each voxel, choosing w = 4 voxels and as = 
2.0 voxels. Homoscedastic noise was then added to each image according to (5.5), with 
a'j = 1, and the s?^'s generated as independent chisquare variables. 

We then applied our Bayesian model selection algorithm based on the posterior mode 
approximation described in Section 5.8 to this dataset, to compute for all regions j = 
1, . . . ,N, the Bayes factor testing the presence of a nonzero regional mean rjj, still using 
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the deformation model defined in (4.3), except this time control points were restricted to 
a grid with regular spacing equal to 7 along each axis. We used the algorithm detailed 
in Section 5.8.1 (except that j = 1, . . . , A^ in our case, with N = 124). 

Finally, lower bounds Pj on the posterior probabilities of nonzero regional means where 
computed, as explained in Section 5.8.3, using an additional penalty controlled by a 
factor c, which we tuned to the optimal value derived in 5.8.3. This penalty was used 
for the Bayes factor computed both in the model with and without spatial uncertainty. 

Results 








Figure 5.8: Statistical maps obtained by the model selection approach on 3D simu- 
lated data. Axial slice z — 37mni in Talairach space. First row contains results for the 
model with spatial uncertainty, bottom row for the model without. From left tor right: 
posterior estimate of the mean effect map /x, the regional mean effect t], and lower 
bounds Pj on the probabilities of a nonzero mean activation, restricted to detected 

regions {Pj > 0.). 



Results from this simulation study are illustrated in Figure 5.7. We estimated the mean 
effect map /i by its conditional posterior mean, averaged over models: 



A = ^HlJ-\y,'^,d'y)n{j\y,w). 
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Pj 


Vj 


Pj 


Vj 


Ai 


Region 


Spatial 


uncertainty 


No spatial uncertainty 




1: 


1.00 


0.53 


1.00 


0.41 


0.56 


2: 


1.00 


0.26 


1.00 


0.19 


0.27 


3: 


1.00 


0.24 


1.00 


0.19 


0.27 


4: 


1.00 


0.25 


1.00 


0.20 


0.26 


5: 


1.00 


0.07 


1.00 


0.04 


0.06 


6: 


1.00 


0.26 


1.00 


0.21 


0.26 


7: 


1.00 


0.04 


1.00 


0.03 


0.05 


8: 


1.00 


0.33 


0.99 


0.23 


0.34 


9: 


0.69 


0.07 


1.00 


0.04 


0.06 


10: 


0.46 


0.04 


1.00 


0.06 


0.05 


11: 


0.00 


0.00 


0.93 


0.02 


0.01 



Table 5.4: Results of the model selection approach on 3D simulated data. Reported 

regions have an approximate posterior probability of being involved in the task greater 

than 0.5. rjj is the posterior estimate of the regional mean effect, /i, is the average over 

region j of the group mean effect map fi. 



where 7r(7|y,w) is noted using a tilde to remind that it is computed using the lower 
bound Pj instead of the posterior probability. It can be seen from Figure 5.7, left, that 
the map obtained using the spatial uncertainty is less noisy, especially in the background, 
indicating the better fit obtained conditional on the most probable displacements a 
posteriori. 

Figure 5.8 shows that both algorithms detected similar regions, but that the model 
with no spatial uncertainty was less conservative. In particular, the latter detected 
activations in region 11, simply because it was located next to region 5 (using the labels 
from Figure 5.7) which contained one of the simulated activation peak. Conversely, this 
region was considered inactive (with Pj = 0.00) using the model with spatial uncertainty. 

This tendency is confirmed in Table 5.4, where we can see that two more regions were 
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detected in the model ignoring spatial displacements. Both methods detected activations 
in region 9, which was contaminated by activations from neighboring regions 8 and 2, 
though the approximate probability is lower in the model with spatial micertainty. 

Finally, we computed the following estimates of the regional means: 

7 

where ry-y is the MAP estimate of rj, conditional on w and 7. Still from table 5.4, it can 
be seen that, as an estimate of the average mean effect within region j, Jx^ = Ylikev l^k, 
f}j is more accurate using spatial uncertainty than not. In fact, the average relative error 
e = -^ Y^- \fjj — fijl/fij was equal to 1% for the spatial uncertainty model against 8% for 
the other. This confirms the fact that our posterior mode approximation did provide a 
better fit to the data than obtained without accounting for displacements. 

5.9 Discussion 

Throughout this chapter, we have investigated the possibility of testing the presence of a 
nonzero mean effect within each region of a pre-defined parcellation of the search volume, 
while accounting for individual images being spatially deformed, according to unknown 
displacement fields. We proposed a Bayesian model selection approach to this problem, 
entailing the computation of the marginal likelihood for each possible observation model, 
specified by a partition of all regions into 'involved' (containing a nonzero mean effect) 
and 'inactive'. 

We have shown how Monte-Carlo estimates of these marginal likelihoods can be ob- 
tained using MCMC techniques. However, this theoretically sound solution fails in 
practice, because the Monte-Carlo estimates of the marginal likelihoods are unstable 
numerically. We then considered an approximate procedure, which consists in assimilat- 
ing the unknown spatial displacements to their most probable values a posteriori. This 
approximation substantially reduces the numerical instability, however we found that 
it re-introduces a certain amount of bias toward false positives, which is systematically 
present when using the model which ignores spatial displacements. We compensated 
this residual bias by introducing an additional penalty, calibrated on a large number of 
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simulated datasets. This last approximation gave satisfying results when validated on a 
final synthetic 3D dataset. 

There is clearly some space for improvement here, in particular concerning the Monte- 
Carlo estimation of the exact marginal likelihood. We have seen that its numerical 
instability stems from the random-walk Metropolis-Hastings step used to sample the 
elementary displacements, which has a high rejection rate, and results in the Markov 
Chain getting stuck in local modes of the posterior density. Thus, a promising direction 
for future work would be to test alternative proposal densities, in view of ameliorating 
our posterior sampling scheme. 

Another alternative we have not yet explored would be to integrate x and /x analytically 
and sample directly from the joint posterior distribution 7r(0,w|y,7), using two altern- 
ing Metropolis-Hastings steps. This would potentially speed-up the convergence of the 
Markov chain, and moreover allow to compute the unconditional MAP estimate of w, 
using simulated annealing. This is required in Section 5.8, and we have so far used a 
conditional MAP estimate as a surrogate. 

Such ameliorations would allow to investigate the behavior of the exact approach in more 
complex situations than provided by the simplistic 2D datasets studied in this chapter. 
This would also provide an additional way of validating our approximate approach, by 
comparing the results obtained by both methods. 

However, we point that the exact approach would nevertheless be associated with a high 
computational complexity, since the number of possible models increases exponentially 
with the number of regions, due to the fact that regions are dependent when modeling 
spatial uncertainty. In contrast, the number of distinct models in the model without 
spatial uncertainty, or under the posterior mode approximation, is linear in the number 
of regions. Thus, the use of approximate techniques, such as the one we developed here, 
seems unavoidable in view of practical applications. 

This raises additional questions concerning the generability of the additional penalty c 
tuning, which we have done here using a colection of simulated datasets. These datasets 
where generated in order to reflect some features of the real datasets we intended to 
analyze (number of observations, inter-subject variance). The resulting penalty was 
found to work well when applied to real data. However, it may very well be that the 
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optimal value for c varies with respect to those parameters we chose to fix during our 
simulation study, and may need to be adapted to different situations, such as a much 
smaller or a much bigger sample size. These questions need to be investigated thoroughly 
in order for the method proposed here to be widely applicable. 



Chapter 6 



Application to real fMRI data 



Abstract 

In this chapter, we apply the Bayesian model selection approach for fMRI group data 
analysis developed in Chapter 5 to a real fMRI dataset. 

To validate our approach, we chose a paradigm based on extensively studied cognitive 
tasks (number and language processing), involving known brain regions. We show that 
our procedure successfully recovers in each case the complete functional network. 

We also compare two different versions of our approach, both with and without modeling 
spatial uncertainty, along with the SPM-like approach, described in Chapter 2. In this 
way, we illustrate the shortcomings of standard voxel-based approaches which rely on 
the thresholding of a statistical map, and how these are overcome by the procedure we 
propose. 

6.1 Data analysis 

The data used here is extracted from the Localizer database [Pinel et al., 2007]. We used 
the same cohort of 38 subjects as in Chapter 3 and refer to Section 3.4 for a detailed 
description. 
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6.1.1 Individual data processing 

Individual data analyses were conducted following the standard pipeline described in Sec- 
tion 2.2, using SPM5 (http://www.fil.ion.ucl.ac.uk/spm/). Data were submitted 
successively to motion correction, slice timing and normalization to the MNI template, 
and spatial smoothing using a 5 x 5 x 5 mm^ FWHM Gaussian filter. For each subject, 
BOLD contrast images were obtained from a fixed-effect analysis on all sessions. 

6.1.2 Methods compared 

For each studied contrast, we used the method developed in Chapter 5 to select the 
functional network most probably involved in the cognitive task under investigation, 
based on a fixed brain parcellation. We used to this end the cortical sulci atlas (CSA) 
developed in [Perrot et al., 2008], derived from the anatomical images of 63 subjects and 
comprising 125 regions which correspond to subdivisions of cortical sulci (see Figure 6.1). 




Figure 6.1: The cortical sulci atlas 



More specifically, we used the posterior mode approximation described in Section 5.8 
to compute the Bayes factor Bj to test the presence of a nonzero mean activation, for 
each region j. The algorithm was tuned as in the simulation study in Section 5.8.2. As 
explained in Section 5.8.3, these Bayes factors where modified through the use of an 
additional penalty, according to (5.32). The factor calibrating this penalty was taken 
equal to c = 0.05, which was found optimal from the simulation study in Section 5.8.3. 
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We compared the results obtained in the model with and without spatial uncertainty, the 
latter corresponding to the special case where a^ = 0, and the elementary displacements 
are frozen to zero. 

We also included the results of the SPM-like approach, described in Section 2.3. In 
this case, the group analyses were restricted to the intersection of all subjects' whole- 
brain masks, comprising 43 367 voxels (no mask was used to restrict the analysis using 
our Bayesian model selection approach). First, a t-score map was computed from the 
37 individual estimated effect maps. Then, a permutation test was used to compute 
three different cluster- forming thresholds, tuned to control the per-comparison error 
rate (PCER) respectively at 10~^, 10~^ and 10"'* uncorrected. For each threshold, a 
second permutation test was used to determine the critical cluster size to guarantee a 
FWER control of 5% (see Section 2.4.2). Each detected cluster was labeled according 
to the region containing its maximum t-score; this is one of the procedures suggested in 
[Tzourio-Mazoyer et al., 2002]. 

6.1.3 Summarizing the inference 

We chose to summarize the results of each procedure using the following statistics. The 
first one is the posterior estimate of the mean population effect, averaged across all 
possible networks, obtained as 



A = ^Hli\y,^j,w]n{-f\y,w) 



using the notations introduced in Section 5.8. Here we note 7r(7|y, w) using a tilde since 
we use the conservative approximation in Section 5.8.3. 

Secondly, we computed the posterior estimate of regional means, also averaged across 
all possible networks: 

7 

where r)^ = argmax^ 7r(T7|y, w,7). 

Finally, the functional network selected by our algorithm is summed up by the map of 
approximate posterior probabilities Pj that the regions j are involved (as defined by 
(5.33)). For clarity, we have only represented regions j detected as involved {Pj > 0.5), 
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and with a positive regional mean estimate fjj, i.e., regions detected as active for the 
task under study. 

6.2 Number processing task 

Spatial uncertainty 




No spatial uncertainty 



Figure 6.2: Number processing task, results of the model selection approach in axial 
slice z = 37mm in Talairach space, (top) using the model with spatial uncertainty, 
(bottom) with no spatial uncertainty. From left to right: approximate posterior prob- 
ability map Pj, restricted to regions detected as activated {Pj > 0.5, fij > 0); labels of 
detected regions (numbers correspond to region index in Tables 6.1 and 6.2); posterior 
estimate of mean effect map fl (see Section 6.1.3). The first two maps are overlayed on 
the mean anatomical image of all subjects. 



As can be seen from Figure 6.2, right, the estimated mean effect map fi is more contrasted 
using the model with spatial uncertainty than without, and slightly less noisy in the 
background. However, the regularizing effect observed previously (see Section 4.6), in a 
context where /x was estimated by marginalizing out the elementary displacements w, 
is absent here, since /x is estimated conditional on a particular value w. 



Application to real fMRI data 



135 



Sulcus/Fissure 



Vj 



Spatial uncertainty 



Vj 



No spatial uncertainty 



Frontal lobe 



1: Left middle frontal 


1.00 


2.85 


1.00 


2.89 


2: Right middle frontal* 


1.00 


3.47 


1.00 


3.20 


3: Left superior frontal 


0.89 


2.21 


0.92 


2.11 


4: Right superior frontal 


0.96 


1.87 


0.98 


1.85 


5: Left inferior frontal* 


1.00 


4.48 


1.00 


4.04 


6: Left middle precentral 


0.99 


4.53 


1.00 


4.50 


7: Right middle precentral 


0.90 


1.99 


0.38 


1.85 


8: Left inferior precentral 


1.00 


6.08 


1.00 


5.54 


9: Right inferior precentral 


0.92 


2.83 


0.58 


2.54 


10: Left anterior cingular 


1.00 


3.62 


1.00 


3.33 


11: Right anterior cingular 


1.00 


4.26 


1.00 


4.02 



Table 6.1: Number processing task, regions detected in the frontal lobe using the 
Bayesian model selection approach. Reported regions have a posterior probability of 
being involved in the task greater than 0.5, and constitute the most probable func- 
tional network given the data, fij is the posterior estimate of the regional mean effect. 
Asterisks (*) mark regions that are found significant at 5% by the SPM-like approach 
(corrected cluster-level inference, cluster-forming threshold set to FPR — 10~^), and 
correspond to the middle activation map in Figure 6.3. 



The posterior probability and region label maps, illustrated in Figure 6.2 suggest that 
the network detected using both models is very similar, though more regions are detected 
using the model with spatial uncertainty, indicating an increased sensitivity. 



The complete list of regions detected as activated {Pj > 0.5 and r) > 0) is given in Ta- 
bles 6.1 and 6.2. Our method successfully detected the bilateral intra-parietal and fronto- 
cingular networks known to be active during number processing [Chochon et al., 1999, 
Dehaene et al., 2003]. Interestingly, the bilateral precuneus sulci where also detected. 
Although not considered as part of the core numerical system, the precuneus has been 
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Sulcus/Fissure 



Vj 



Spatial uncertainty 



Vj 



No spatial uncertainty 



Parietal Lobe 



12: 


Right transverse parietal 


0.72 


6.67 


0.64 


5.28 


13: 


Left intra-parietal* 


1.00 


5.31 


1.00 


4.89 


14: 


Right intra-parietal 


1.00 


3.50 


1.00 


2.95 


15: 


Left precuneus 


0.99 


6.09 


1.00 


5.81 


16: 


Right precuneus 


0.82 


4.10 


0.88 


3.74 


17: 


Left postcentral intraparietal 


0.53 


2.35 


0.64 


2.14 


18: 


Right postcentral intraparietal 


0.61 


2.07 


0.38 


1.90 


19: 


Left parieto-occipital 


0.68 


1.75 


0.35 


1.77 


20: 


Right parieto-occipital 


0.78 


1.53 


0.00 


1.25 



Other 



21: Right callosal 



0.97 



2.10 



0.59 



1.99 



Table 6.2: Number processing task, regions detected in the parietal lobe using the 
Bayesian model selection approach. Reported regions have a posterior probability of 
being involved in the task greater than 0.5, and constitute the most probable func- 
tional network given the data, fij is the posterior estimate of the regional mean effect. 
Asterisks (*) mark regions that are found significant at 5% by the SPM-like approach 
(corrected cluster-level inference, cluster- forming threshold set to FPR = 10~^), and 
correspond to the middle activation map in Figure 6.3. 



linked to memory access and a wide range of high-level tasks [Cavanna and Trimble, 2006]. 

The networks detected with and without spatial uncertainty are very similar. However, 
4 more regions were detected with spatial uncertainty; in particular, no activations were 
detected in the parieto-occipital region when neglecting spatial uncertainty. This is 
consistent with the fact that the estimated regional means fi are systematically higher 
under spatial uncertainty, and the higher contrast observed on the estimated mean effect 
map. 



In contrast, only three activated clusters were detected by the SPM-like approach at the 
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PCER = 10" 



PCER = 10"^ 



PCER = 10" 



FiGURE 6.3: Clusters detected at different cluster-forming thresholds, for the number 
processing task, in axial slices z = 37mm in Talairach, overlayed on the subjects' 
mean anatomical image in the background) . The threshold is tuned to control the per- 
comparison error rate (PCER) respectively at 10~^, 10~^ and 10~^ uncorrected. Each 
cluster surviving the FWER controlling-threshold at 5% is represented with a specific 

color. 



chosen cluster-forniing threshold. Each cluster contained over a thousand voxels, and 
extended over several atlas regions, hence merging several functionally distinct areas. 
Also, no activations were detected in the right frontal area. Using different thresholds 
could not solve these problems, as illustrated in Figure 6.3. 



6.3 Language processing task 



As in the previous case, a gain in the contrast of the estimated mean effect map p, 
associated with the model under spatial uncertainty can be observed in Figure 6.4, 
right. Apart from this, the regions detected in the displayed slice using both models 
were very similar, with a single additional region detected under spatial uncertainty. 

The network detected by our method, shown in Table 6.3 was consistent with the known 
organization of language processing in the cerebral cortex, as described in [Pinel et al., 2007] 
for instance. This involves the superior temporal sulcus, with a dominance of the left 
hemisphere (which can be linked to the higher estimate of the regional mean observed 
here), and the left frontal areas (such as the precentral sulcus detected here) and the 
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Spatial uncertainty 




No spatial uncertainty 



Figure 6.4: Language processing task, results of the model selection approach, in 
sagittal slice x — — 46mni in Talairach space, (top) using the model with spatial uncer- 
tainty, (bottom) with no spatial uncertainty. From left to right: approximate posterior 
probabilities map Pj, restricted to regions detected as activated {Pj > 0.5, % > 0); 
labels of detected regions (numbers correspond to region index in Table 6.3); posterior 
estimate of mean effect map fi (see Section 6.1.3). The first two maps are shown above 
the mean anatomical image of all subjects. 




PCER = 10- 



PCER = 10 



PCER = lO"'' 



Figure 6.5: Clusters detected at different cluster-forming thresholds, for the language 
processing task, in the sagittal slice x = —46mm in Talairach, overlayed on the subjects' 
mean anatomical image in the background) . The threshold is tuned to control the per- 
comparison error rate (PCER) respectively at 10~^, 10~^ and 10~'* uncorrected. Each 
cluster surviving the FWER controlling-threshold at 5% is represented with a specific 

color. 



supplementary motor area, part of the paracentral area detected here. Finally, activa- 
tions were detected in the collateral fissure, which borders the lingual and the fusiform 
gyrus, both known to be involved in the visual recognition of words. 



Again, there were some minor differences in the networks detected using both models, 
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Sulcus/Fissure 


Pj 


Vj 


Pj 


f}j 




Spatial uncertainty 


No spatial uncertainty 


1: Left middle precentral 


0.05 


2.19 


0.52 


1.43 


2: Left superior precentral 


0.99 


6.62 


0.99 


5.54 


3: Right superior precentral 


0.78 


2.78 


0.00 


2.23 


4: Left inferior precentral 


0.97 


3.64 


0.91 


3.33 


5: Left middle precentral 


0.99 


5.11 


0.99 


4.74 


6: Left paracentral 


0.93 


4.26 


0.86 


3.38 


7: Left posterior sylvian 


1.00 


2.41 


1.00 


2.19 


8: Left anterior sylvian 


0.92 


5.34 


0.72 


4.11 


9: Left superior sylvian 


1.00 


5.64 


0.98 


5.07 


10: Left superior temporal* 


1.00 


7.44 


1.00 


6.99 


11: Right superior temporal* 


1.00 


4.5 


1.00 


3.94 


12: Left central 


1.00 


2.0 


1.00 


1.74 



13: Left anterior collateral 1.00 2.47 0.96 2.39 

14: Left middle collateral 0.56 2.77 0.17 2.51 

15: Left posterior collateral * 0.94 3.48 0.98 3.28 



Table 6.3: Language processing task, regions detected using the Baycsian model 
selection approach. Reported regions have a posterior probability of being involved in 
the task greater than 0.5, and constitute the most probable functional network given 
the data. % is the posterior estimate of the regional mean effect. Asterisks (*) mark 
regions that are found significant at 5% by the SPM-like approach (corrected cluster- 
level inference, cluster-forming threshold set to FPR = 10"'^), and correspond to the 

middle activation map in Figure 6.5. 



and the estimated regional means were generally higher in the model with spatial un- 
certainty. More importantly, the background, which was included in the list of regions, 
was detected as negatively active [Pj = 1, r) = —0.01) in the model without spatial un- 
certainty, suggesting that some activations were projected outside the brain volume due 
to registration errors. This error was corrected when modeling spatially uncertainties. 
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In contrast, the SPM-like approach detected three clusters at the selected threshold (see 
Figure 6.5, middle). One of them (not represented) extended approximately over the 
right superior temporal sulcus, as defined in the CSA. Another one (in red), extended 
over several occipito-temporal parcels. Finally, a large cluster (in orange) containing over 
1 000 voxels encompassed many different atlas regions, suggesting that it contained sev- 
eral functionally distinct areas. As in the case of number processing, this segmentation 
issue was not solved by varying the cluster-forming threshold. 

6.4 Conclusion 

In this chapter, we have validated the Bayesian model selection approach for fMRI group 
data analysis developed in Chapter 5 using a real dataset, by correctly recovering the 
brain functional networks associated to basic number and language processing, according 
to the description found in previous works. Furthermore, we have illustrated certain 
shortcomings of the classical, SPM-like approach, which fails to properly segment distinct 
functional regions, or misses them altogether, depending on the choice of the cluster- 
forming threshold. 

Similar results were obtained using the models with and without spatial uncertainty. A 
probable explanation is the use of an additional penalty on data overfit in both cases, 
which prevented the false detection of inactive regions. However, the posterior mode 
approximation in the model with spatial uncertainty did improve the results, in that it 
allowed to detect additional regions, and enhanced the contrast of the estimated mean 
effect maps. This, together with the higher estimated regional mean effects, suggest that 
the "functional registration step", which consists in displacing the individual images 
according to the most probable displacements a posteriori, provided a better alignment 
of the activated regions of the different subjects. 



Chapter 7 



Conclusion 



7.1 Main Results 

Throughout this thesis, we have developed a new approach for the statistical analysis 
of multi-subject fMRI data, whose aim is to detect the cerebral regions involved in a 
certain cognitive task. Our objective was to address jointly certain limitations of the 
standard SPM-like approach, which are: the dependence on an arbitrary threshold to 
define clusters of potential activity; the lack of control over false negative risks; and the 
assumption that individual images are in perfect alignment. 

In a first contribution, we revisited the adaptive thresholding technique developed in 
[Lavielle and Ludeha, 2007] by removing its dependence on a window parameter, mak- 
ing it more stable at low signal to noise ratios. This method provides an answer to 
the problem of choosing a detection threshold, while implicitly balancing false posi- 
tive and false negative risks, by minimizing a model selection criterion rather than a 
multiple comparison type I error rate. It gave satisfying results when applied to in- 
dividual and group activation maps, compared to Gamma-Gaussian mixture modeling 
[Beckmann et al., 2003b]. This technique may be best adapted to within-subject anal- 
ysis however since it aims to detect individual voxels rather than regions, which are 
ultimately the objects of interest in group analysis. 

The second contribution of this thesis consists in relaxing the assumption of perfect 
match between the estimated effect maps of the different subjects. To this end, we gen- 
eralized the classical mass-univariate (voxelwise) model for fMRI group data analysis 
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by incorporating a set of hidden variables, representing the unknown registration errors, 
modeled as random deformation fields. In contrast, all previous approaches dealing with 
spatial uncertainty in fMRI data were feature-based, meaning that they aimed to match 
high-level features extracted from the individual activation maps. We proposed to esti- 
mate the map of group mean effects in a Bayesian setting by its posterior mean, showed 
the consistency of the joint posterior density of all model parameters, and designed a 
Metropolis-within Gibbs (MH-Gibbs) algorithm [Tierney, 1994] to draw samples from 
it, and compute Monte-Carlo estimates of the mean effect map. 

Our simulation studies evidenced a stretching effect of the estimated activation pattern 
when the registration errors where unaccounted for, which caused neighboring activa- 
tions to be merged. We also showed that this stretching effect could be substantially 
reduced when registration errors were modeled using our approach. When applied to 
real fMRI data, our method yielded estimates of the group effect map under spatial 
uncertainty that were both smoother and more contrasted than under no spatial un- 
certainty, an effect that could not be reproduced by linear isotropic smoothing. These 
encouraging results were obtained in spite of the slow mixing of our MH-Gibbs algo- 
rithm. In particular, the amplitude of the spatial displacements was under-estimated, 
suggesting some space for improvement. 

Finally, we proposed a new paradigm for ROI-based fMRI group data analysis addressing 
jointly the three above-mentioned limitations of the SPM-like approach. Based on a 
Bayesian model selection framework, regions involved in the task under study are selected 
according to the posterior probabilities of a nonzero mean activation, given a pre-defined 
parcellation of the search volume into functionally homogeneous regions. Thus our 
approach is threshold-free, while allowing to incorporate prior information, provided 
that the parcellation is sensible. By controlling a Bayesian risk, our approach balances 
false positive and false negative risks, with weights than can be tuned depending on 
the application domain. Importantly, it is based on the same spatial uncertainty model 
as in Chapter 4, and thus accounts for the mis-alignment of individual images, due to 
inevitable registration errors. As a consequence, for each subject, the membership of a 
voxel to a given region is probabilistic rather than deterministic. This effectively allows 
to de-weight the contribution of activations to a region's mean level of activity, when 
they have been accidently projected into it by mis-registration. 
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This approach requires to evaluate the marginal likelihood of each model, defined as 
a partition of regions into involved and inactive. These marginal likelihoods are not 
available in closed form, but can be evaluated numerically. We chose to use Chib's 
method [Chib, 1995, Chib and Jeliazkov, 2001]. This required the specification of a 
MCMC-SAEM algorithm [Kuhn and Laviehe, 2004], adapted from the MH-Gibbs sam- 
pler above, to derive the MAP estimates of the model parameters, and a simulated 
annealing scheme to obtain the posterior mode of the displacement field density, condi- 
tional on these parameters. 

Results on both simulated and real fMRI data show that the previously evidenced 
stretching effect may cause inactive regions to be contaminated by neighboring acti- 
vations and be detected as active by mistake. This bias toward false positives is reduced 
when modeling spatial uncertainties. However, the marginal likelihood estimate in the 
model with spatial uncertainty turned out to be numerically instable, presumably be- 
cause of the slow mixing observed when sampling the displacement fields. We proposed 
an approximate procedure, which consisted in fixing the displacement fields to their con- 
ditional MAP value, found by the above-mentioned simulated annealing algorithm. This 
approximation proved effective in stabilizing numerically the output of the algorithm, 
but did not entirely remove the bias toward false positives observed when neglecting 
spatial displacements. We compensated this residual bias by including an additional 
penalty on model fit, resulting in a lower bound on the probability of each region being 
involved in the task at hand. This final procedure was validated on both simulated and 
real fMRI datasets. 

7.2 Perspectives 

Many promising directions can be envisioned for future work, some of which have been 
mentioned earlier. To start with, the Metropolis-Hasting algorithm used to sample the 
displacement fields has a high rejection rate and mixes very slowly, as discussed in Ap- 
pendix D, and Sections 4.7, 5.7, and 5.9. This results in the numerical instability of our 
model selection procedure, and needs to be dealt with, for instance by adopting alterna- 
tive proposal densities to the isotropic and stationary Gaussian random distribution we 
use here. Indeed, the posterior density of the displacements is likely to be constrained 
along certain preferential directions, especially in highly contrasted regions of the mean 
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effect map /i, sucli as the interface between an active parcel and an inactive one. Also, 
we expect the posterior density to be more peaked in these regions than in less con- 
trasted areas, such as the background. Generalizing the proposal's covariance structure, 
and allowing it to vary across control points, would therefore seem a reasonable direc- 
tion toward which extending our work. Integrating out analytically x and /x instead of 
simulating them could also promote mixing of the Markov chain and constitute another 
promising line of work. 

Another issue requiring further investigation concerns the additional penalty fit we found 
necessary to include in our approximate procedure. This requires the tuning of a certain 
scale factor c, which we have done using simulated datasets. Though this was found to 
work well on a real-life application, it is necessary to study carefully the stability of the 
optimal value on a wider range of simulated datasets, in order to determine its possible 
dependence on certain critical variables, such as sample size or inter-subject variance. 

In a broader perspective, the framework we propose here for fMRI group inference can 
be extended to address a wide variety of situations. For instance, we have focused on 
the characterization of the activation pattern of a single population of subjects. As 
previously discussed in Section 2.3.2, our model could be extended to compare different 
populations, such as a certain category of patients and healthy subjects, or to corre- 
late fMRI activations with certain interest covariates. This means specifying the mean 
population effect, or, in terms of the regional response model (5.1), the mean regional 
effect, as a linear term in the regressor variables of interest. There is an increasing 
need for such models, particularly in studies combining neuroimaging and genetics, such 
as the IMAGEN project http://www.imagen-europe.com/, which constitute a new and 
promising trend in neurosciences. 

Another exciting prospect, mentioned in Section 5.2, is to extend the model selec- 
tion framework to infer the functional connectivity of the specified parcels. Following 
[Bowman et al., 2008], this would imply modeling the covariance structure of the re- 
gional means. Identification of functional networks in this setting would involve both 
selecting the regions involved in the task at hand, and the pairs of regions functionally 
correlated. The ensuing model selection problem would be much more challenging than 
the one addressed in the present work, the number of possible models involved being 
increased by the number of potential interactions (in contrast, [Bowman et al., 2008] 



Conclusion 145 

simply estimates the covariance matrix of the regional means, without performing any 
further inference). Furthermore, reducing this complexity would no longer be possible, 
even by adopting an approximation such as the one introduced here (see Section 5.8). 

In another direction, our approach can be generalized to probabilistic instead of deter- 
ministic parcellations. This would especially make sense when using anatomical atlases, 
some of which are probabilistic to account for the inter-subject anatomical variability, 
such as the the CSA atlas [Perrot et al., 2008] (in Chapter 6, we used a deterministic 
version of this atlas, based on the most probable label for each voxel). Interestingly, this 
means that the labels defining the membership of each voxel would have a joint prior dis- 
tribution, given by the probabilistic parcellation, and consequently also a posterior distri- 
bution. In other terms, a Bayesian estimate of the parcellation itself would be available, 
informed by the prior parcellation. Thus, our framework could be used to revisit group- 
level parcellation approaches, such as developed in [Flandin, 2004, Thirion et al., 2006c]. 
It is also an alternative answer to the choice of a parcellation, which we discuss in 5.6. 

Finally, the methodology we have developed is well-adapted to the analysis of surface- 
based data (see Section 2.6.3), for the following reasons. First, because cortical struc- 
tures are better matched across subjects using surface-based registration than classi- 
cal volume-based affine registration [Fischl et al., 1999], we may expect better defined 
anatomical ROIs. The superiority of surface-based registration may not be as obvious 
when compared to nonlinear volume-based registration methods, such as those compared 
in [Klein et al., 2009]. However, to our knowledge, no comparison of surface versus vol- 
ume based nonlinear registration has been conducted up to now. 

On the other hand, we have seen that these approaches require projecting the fMRI data 
of each subject on its cortical surface, which cannot be done in a straightforward way, 
making the localization of activations on the cortical surface uncertain. Thus, the spatial 
uncertainty model we have developed to account for registration errors would still be 
necessary to account for these projection uncertainties, even though registration errors 
may be less important. Furthermore, the simulation study in Section 5.8.1 suggests 
a better behavior of our posterior mode approximation on 2D datasets than on 3D 
datasets, making this application to surface data a promising prospect. 



Appendix A 

Elements of multiple testing 
theory 



A.l Generalities on multiple testing 

Consider the problem of testing simultaneously m distinct null hypotheses. In the con- 
text of neuroimaging, there can be one hypothesis per voxel, as defined in Section 2.3.3, 
or one hypothesis per cluster identified above a certain threshold. For all /c = 1, . . . , m, 
the test is based on a certain decision statistic T^. ^^ = indicates that the k-th. null 
hypothesis is true, and "Hk = 1 that it is false. 

The quantities of interest for multiple testing are then summarized in Table A.l, fol- 
lowing [Benjamini and Hochberg, 1995]. We note A^o = {k ■ Ti-k = 0} and Mi = {k : 
Tik = 1} the sets of true null and false null hypotheses, respectively, and their number: 
rriQ = \M.o\,mi = |A^i|, which are unknown. Ai = {l,...,m} = A^oU-^i i^ ^^^ 
number of tested hypotheses. The only observed variables are the number of rejected 
null hypotheses R, and the number of accepted null hypotheses, R — m. The goal of any 
multiple testing procedure is to minimize both the number V of type I errors, or false 
positives, and the number T of type II errors, or false negatives. 
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# accepted # rejected 



# True null hypotheses 


U 


V 


mo 


# False null hypotheses 


T 


S 


nil 


t 


W 


R 


m 



Table A.l: Number of errors in a multiple testing problem. 
Type I error rates 

A common strategy for limiting the number of errors is to maximize the statistical power 
of the tests while controlling a certain type I error rate at a given level a. The error 
rates most often used, as defined in [Ge et al., 2003], are the following: 

• False positive rate (FPR). It is the expectation of the proportion of type I errors 
among all the tests: 

FPR = E{V)/m (A.l) 

• Family-wise error roie(FWER). It is the probability of at least one type I error: 



FWER = P[V>0] 



(A.2) 



False discovery rate (FDR). It is the expectation of the proportion of type I errors 
among all rejected null hypotheses. When no null hypotheses have been rejected, 
this proportion is set to 0, yielding: 



FDR = S 



V 



L/?>0 



(A.3) 



These quantities can be compared through the following inequality [Ge et al., 2003]: 



FPR < FDR < FWER. 



Thus, if the FWER is controlled at a given level a, i.e., if FWER < a, then the other 
rates are also automatically controlled at the same level. FWER is the most stringent 
criterion possible for multiple testing. As such, it is very much used in neuroimaging. 
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where a high level of confidence is required to report a brain region as involved in a certain 
task. Conversely, the FPR involves no correction at all for multiple comparisons, since 
it is controlled at a given level a as soon as the individual hypoteses Tik = are tested 
at level a. 

Exact, weak and strong control 

The error rates defined in the previous section are implicitely defined conditionally 
on the intersection of all null hypotheses: V-Mo — ClkeMo^^'' ~ ^-^' ^^ noted in 
[Ge et al., 2003]. Control on a given error rate conditional on the true intersection 
T-Lmo of ^^11 hypotheses is called exact control. For instance, a multiple comparison 
procedure has exact control on the FWER if it can control the quantity: P[V > OITImo] 
at any given level a. 

However the set A^o is unknown, so the error rates are generally computed, and con- 
trolled, under the global null hypothesis "Hm = PlfceA^i^fc ~ *-*}' referred to as a weak 
control. It is important to note that weak control alone does not in general imply exact 
control. Hence, a multiple testing procedure which has only weak control is in general 
not a valid procedure, since its results hold only under the assumption that all null 
hypotheses are true. 

Finally, strong control means control for every possible choice of A^o- For instance, strong 
control of the FWER means to be able to control the quantity max_A4'cA^ P[V > 0|^x'] 
at any given level a. Strong control implies both weak and exact control, and is therefore 
sufficient to define a valid mutiple comparison procedure. 

Subset pivotality and p-values 

Subset pivotality is a central property in multiple testing. It is used to ensure that a 
multiple comparison procedure having weak control over a certain error rate also has 
strong control. It is usually defined as follows [Westfall and Young, 1993]: 

Definition A.l (Subset Pivotality). The joint distribution of the test statistics 
(Ti, . . . ,Tm) is said to have the subset pivotality condition if for all subset IC C Ai, 
the joint distribution of {T^jkeK. is the same under the global null hypothesis T-Lm = 
PlfceA^i^fc = 0} as under the restriction %)c = Plfce/ci^fc = 0}. 
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An immediate consequence of this definition is that, under subset pivotality, the joint 
distribution of {Tk)keK. is the same under any intersection of nuh hypotheses including 
Tiic- Informally, subset pivotality implies that the test of each given null hypothesis 
Tik = can be done independently of the status of all other hypotheses Tik', which is a 
quite natural requirement. For instance, if each null hypothesis Tik = concerns a data 
vector Dk, if the -Dfc's are independent, and T^ is a function of D^ only, then the subset 
pivotality property is trivially verified. 

Otherwise, there is no general characterization of subset pivotality. We will see that 
most multiple testing procedures that have strong control rely on this property. 

To illustrate the importance of this notion, consider the p-values, which are another 
useful tool in the multiple testing setting. They are traditonally defined as follows: 

Definition A. 2 (p-values). Note (ti, . . . ,im) the observed values of the test statistics 
(Ti, . . . , Tm)- For k = 1, . . . ,m, The p-value pk associated with tk is given by: 

Pk = P[Tk > t^m = 0]. (A.4) 

Thus, pk is the lowest level a at which Ti^ = is rejected on the basis of t^. 

Note that, under subset pivotality, the quantity defined in (A.4) is well-defined, since 
the probability in the right term takes the same value under any intersection of null 
hypotheses including Tik = 0, and in particular under the global null. On the other 
hand, if subset pivotality is not verified, then the probability of T^ > t^. depends not 
only on T-L^, but on the state of potentially all other hypotheses Tik', hence the quantity 
in (A.4) has more than one possible value, and is consequently ill-defined. This is 
an important point, though scarcely mentioned in the multiple testing llitterature. It 
shows that multiple comparison procedures based on p-values often rely implicitely on 
the subset pivotality property. 

Proposition A. 3. Strong control of the maxT procedure 

If the joint distribution of the test statistics {Tk)\<k<d verifies the subset pivotality prop- 
erty (Definition A.l), The maxT procedure, defined in Section 2.4-1, has strong control 
over the family wise error rate, i.e..' 



FWER < P 



maxTfc > uIHm 

keM 



(A.5) 
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Proof. This follows directly from the definitions: 



FWER 



P[V > 0\Hm,] 

P[3k G A^o, Tk > uI-Hmo] 

P 



= P 
< P 



max Tk > u\Hmo 

keMo 

max Tfc > uITIm 
feeXo 

maxTfe > ulV-M 

k&M 



(A.6) 



where the the fourth equality holds under subset pivotality D 

A. 2 Strong control of the maxT and maximum cluster size 
tests 

Based on the definitions in Section 2.4.1, we now show that the test on the maximum 
suprathreshold cluster also has strong control: 

Proposition A. 4. Strong control of the maximum cluster size test If the joint 
distribution of the test statistics {Tk)i<k<d verifies the subset pivotality property, then 
the test on the maximum cluster size has strong control over the family-wise error rate: 



FWER < P[max ^d > iVl^x]- 
CiQM 



Proof. The cluster-level family-wise error rate (2.10) can be expressed as: 



FWER = P[ max ^Q > NlnMo] 
CiCMo 

< P[ max ^Ci n Mo >N\nMo] 
CtCM 



(A.7) 



This last event depends only {Ti^)kGMoj whereas the first one depends on whether Cj C 
A^o for each suprathreshold clusters Cj, an event depending on statistics T^' in all 
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neighboring voxels k' of Cj, some of whom may be outside Mq (a voxel is said to be 
neighboring Cj if it is neighbor to a voxel contained in Cj). 

Hence, we may apply subset pivotality, yielding: 

FWER < P[max SanTWo >iV|^x]- 
C'iCM 

< P[rnaxJQ|^A,]n 

(A.8) 



Appendix B 

Proof of the consistency of the 
random threshold procedure 



Following [Lavielle and Ludena, 2007], we first recall some notations. Set Ui = yi for 
i £ Ik* and Vi = yi for i ^ Ik*; notice that (fj) is a sample from the distribution F^. 
Let (^(i))i<i<fc* and (w(i))i<i<n-fc* be the sequences {\ui\) and (|fj|) in decreasing order. 
Let ri„ be the subset of O where f(i) < a„/2 and ^(jt*) > a;„/2. 

A first lemma in [Lavielle and Ludefia, 2007] shows that P(r2„) — )• 1, i.e., the collections 
(n(j\) and {vu\) are stochastically in order with high probability. The proof can then be 
restricted to il^. 

Now, let Efc(Tfc j) and Q^j be defined as in Equation (3.1). Using Proposition 3.1, we 
have: 



n—k 

mTkj) = i(i+ Y. VO; 

'^kj — , ^J^k,n-k 

^k\-l-k,n-k) 



^k,j,n-'-k,n—k- 
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Also, let Oj = Eo(X(j)) = J21=i ^/^- Equation (3.2) can be shown separately for k > k^ 
and k < k^. Since the two cases are treated similarly, we will restrict ourselves here to 
the case k > k*. On il„, : 



-'fcj ^k,j — ^k.j -tJk,j,nJ'k,n—k 

— \^k,j — ^k^iTkj)) — BkJ^n {Tk,n~k — ^k^{Tk,n-k)) 

+^k*^{Tk,j) - BkJ,n^kf^{Tk,n-k) 
^ ''^k,j + ^k,j- 

Thus Tfc j — Qfc j is decomposed into a random part R^j and a deterministic part S^j- 
Over i7n, Rkj is a function of f(fc), . . . ,V(^n-k*)- Before going further, we now recall the 
following result: 

Let X(i) > • • • > -^(n) be an ordered sequence of independent Exp{l) random variables. 
For I < j < n, let Tj = ^21=1 -^{i)- Introduce for t £ [0, 1] the random process dn{t) = 
Ttj^t] — '^{T[nt]\Tn) ■ Then it is shown in [Lavielle and Ludeha, 2007] that -j=dn{t), as a 
process indexed on t G [0, 1], converges in distribution to a certain zero mean Gaussian 
process A. 

To use this result, let k = \tri\ and j = [sn], for < t < 1 — c and 0<s<l— i* — t, for 
c in [AF3]. Then -^{Tk,j - Qk,j)'i-n„ = ^/=^{T[tn],[sn] - QH,[sn])ln„, as a process 
indexed by (t, s) G (0, 1)^, converges in distribution to the zero-mean Gaussian process: 



l-t* 
1-t 



Alt±l^]-A^'-'' 



1-t* / \l-t 



similarly, ^==i?fc j„E/^.* (Tfc„„fc)ln„ converges in distribution to another zero-mean 
Gaussian process, and so does their sum, J—^ Rj^j1q^. 



On the other hand. 






^^d — 7 ^ (°»+i ~ Qt + Bkj^n{ai+n-k — aj)), (B.l) 
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so that there exists a constant 7 > 0, which depends on c in [AF3], such that for all 
n > 1, k^ < k < n — Kn, we have supx<j<„_fc \Sk,j\ > lik — A;*). Finally we use the 
following inequality: 



Fk^{kn-k^>nun) < P(??fc* > inf m)- 

k-k*>nun 



From Equation (B.l), Sk*j = 0, hence it follows that: 



V^n - k^ rjk* = sup \Rk*,j + Sk*„j\ 

= sup \Rk*^,j\ 

< sup sup \Rk,j\- 
On the other hand, 



\/n — k inf rj^ = inf sup \Rk,j ~^ Skj\ 

> inf sup \Sk,j\ — sup sup \Rk,j\-, 

k-k*^>nun l<j<n-k ' k>k^ l<j<n-k 

SO that we have: 
Fk*jkn-k*^>nun) < F{C sup sup \Rk,j\ > inf sup \Sk,j\) +F{n^n) 

k>k* l<j<n—k ^ k„>nu„ l<j<n—k 

< P(C7sup sup \Rkj\>-fnUn) + F{n^J, 

k>k'f^ l<:j<:n—k 

where C is a constant which depends on c in [AF3]. This last probability vanishes as n 
goes to infinity, due to the weak convergence of Rkjln^O 



Appendix C 

Identifiability in the model with 
spatial uncertainty 



We show here the identifiabihty of the parameters in the hierarchical model introduced 
in Chapter 5, 6 = {r} , u"^ , a"^ , a"^) . Identificability in the model in Chapter 4 requires 
a slightly different proof, since the parameter vector is defined by = (/x, o"^, cr|). We 
have not included it here however, as it uses very similar arguments. 



Theorem C.l. (Identifiability in the Regionalized Model) The full hierarchical 
model specified by (2.5), (4-2), (4-3), (4-4)7 (^-V ^^ identifiable. 

Proof. This result may be re-phrased by saying that, if 6 ^ 0' , then f{y\6) ^ /(y|^') 
in at least one point y. As noted previously in Appendix F, the conditional density 
/(y|w,0) is multivariate Gaussian. It depends on w only through the definition of the 
voxelwise blocks 



4 = S {i,l)i<i<n '^i{l) = k 

I l<l<d 

containing for each k the indices of observations yii displaced to voxel k. The collection 
of all voxelwise blocks I = (Ii, . . . , Id) constitutes a partition of the cartesian product 
{1, . . . , n} X {1, . . . , d} into d subsets. Noting X the set of all possible such partitions, 
we see that the PDF can be re-written as the mixture: 



fiy\e) = Y,^{I\al)fiy\I,rj,u^cT' 



lex 
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where 7r(I|cr|) = /^gj 7r(w|(7|)riw is the probabihty of block configuration I, and /(y|I, t], v^,cr^ 
is multivariate Gaussian, with mean and covariance matrix given by functions of {rj, I) 
and (cr^,I), respectively. Note that these functions depend on w (through I), and also 
that they may not be one-to-one. For instance, the observations may well be all dis- 
placed into a single region jq, in which case the regional parameters (ryj, z^y ,0"^) of all 
other regions have no effect on the data. 

Because any finite family of distinct Gaussian PDFs is linearly independent, and the 
mixing weights 7r(I| cr|) in the above display are always strictly positive, the only way 
we can have /(y|0) = /(y|^') in all data points y is if there exists a permutation 5 of 
X, such that for all I G X and all y eW"^ : 

7r(I|a|) = n{S{I)\a'i) 

/(y|I, 77,1.2, ^2) ^ f(y\6iI),'n',I.'^CT'^). (C.l) 

This the analog of the 'label-switching' phenomenon in classical finite mixture models. 
We now show that such label-switching is impossible, because the partitions I are not 
equally probable. Note Iq the most probable block configuration. 

In = argmax7r(I|(Tq). 
lex '^ 

Because ■w is a zero-mean Gaussian with spherical covariance (4.4), this most probable 
configuration is obtained for w = 0, so that Iq = (/^)i<A:<d) where for all k: I^ = 
{(1, /c), . . . , (n, k)}. Likewise, 5(Io) = loi hence from C.l it follows that for all y : 

/(y|Io,r/,i/2,a2) = /(y|Io,r7',i.'2,a'2), 

which implies that (r/, u^, a^) = {r/', u'"^, cr''^). 

Finally, Iq is a star domain in that if w £ Iq, then the segment {Aw, A G (0, 1)} lies inside 
Iq. Thus 7r(Io|cr|) is a strictly decreasing function of a^, and since 7r(Io|o"|) = 7T{Io\a'g), 
we have (t| = o"^ D 



Appendix D 

Sampling the posterior density 
using a Metropolis within Gibbs 
algorithm 



As described at the beginning of Section 5.5, the approach in [Chib, 1995] assumes 
that a MCMC algorithm has been devised to sample the posterior density of the model 
parameters . As expressed in (5.18), this density is obtained as a marginal of the joint 
posterior density: 

7r(z,6'|y,7) = 7r(x, w, /i, r/, i/, cr^, cr||y,7) (D.l) 

oc /(y|x)7r(x|/x, cr^, w)7r(/i|r/, u'^)TT{w\al)TT{r]\u^)TT{u^ , c^^ o-|). 

We use the Gibbs sampler algorithm [Geman and Geman, 1984] to generate a sequence 
of samples from this joint density by sampling successively each of the six following 
blocks: X, w, /j,, (r/, v), <t^, and (t|, conditionally on all others. The conditional density 
of each block is obtained from the complete density (D.l), by treating all other blocks 
as constants. These conditional densities are available in closed form, except for the 
elementary displacements w, and are given as follows. 
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Hidden effects. For each subject i and each voxel k, such that the displaced voxel <pi{k) 
is in region j, it follows from (4.1) and (4.2) that: 

Population mean effect. According to (4.2 and 5.1), for all voxel k in region j, 

where m^ = n^^ Z^0.(;)=fc Xi,k and S*! = n^^ Z^0i(i)=fc(^j,fe -"ifc)^ are the empirical mean 
and variance of the effects centered on /x^, and n^ = jl{(i, A;); (/>j(/) = A;} the number of 
these. 



Regional parameters. For each region j, (5.1), (5.11) and (5.11) yield: 

r/,>|,7, =0,... ~ 5(0); (D.4) 

^|l--- ~ Xg a + d„/3 + -^A^i-7 /^(^^'^^.^^ |, (D.6) 
where dj = tJ{/c; ik = j} is the size of region j. 



Population variance. For each region j, the posterior density of a? is computed from 



(4.2) and (5.12), resulting in: 



^2 1 

aj\... ~ 



2:a(« + ^5^n,,/3+^^ Y, i^^,l - f^kf] . (D.7) 

V 4=i 4-=j>i(0=fc / 



Spatial variance. The posterior density of the spatial variance a^ is given by (4.3) 
and (5.12): 



2 1 ^^ / , 3nS ^ Ei,b II Wi,6 



all... ~ Xg a + ^,/3+ ^''""^ ' " . (D.^ 
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Elementary displacements. For each subject i and each control point b, according 
to (4.2) and (4.3): 

Wj^fel... ex 7r(wi,6|o-|)7r(xi|/i,<T^,Wi), (D.9) 

where the prior conditional densities of Wj;, and Xj are defined in (4.4) and (4.2). 
This density cannot be computed analytically, let alone directly sampled. Indeed, 
7r(xi|/x, a^, Wj), as a function of Wj, is piecewise constant, due to the discrete interpola- 
tion of the mean population map (see Section 4.2). Instead, we sample each elementary 
displacement Wj^fc using the Metropolis- Hastings algorithm [Hastings, 1970], wherein a 
candidate value w^ ^ is sampled from any proposal density q{w'- ^) (which may depend 
on the other variables), and accepted with probability: 

. /, ^K,6kl)^(xil/^''^^09(wi,b)l ,^^^, 

a = mmM,— , ^n ^ , 5 \-r~r-{(- (D.IO) 

1^ 7r(wi,fc|cr|)7r(xi|/x, o-^ Wj) g(w. J J 

In the above expression, w^ stands for the vector Wj, where Wj (, has been replaced by the 
candidate value w^ ^. This means that our Gibbs sampler is actually a Metropolis within 
Gibbs algorithm (see for instance [Tierney, 1994]). We used a random walk proposal 
(?(w^ fc) = -^i^'i b'l ^i,b-: '^^ly-'-s)' with a proposal variance (y\y/ tuned during the burn-in 
period to obtain an acceptance rate close to 0.1. A higher acceptance rate of 0.25, as 
advocated in [Roberts et al., 1997], proved a bad idea in our case, as it resulted in such 
a low proposal variance c^p^/ that the Markov Chain remained almost unmoving. 



Appendix E 

Maximization of the posterior 
density using a MCMC-SAEM 
algorithm 



We start by recalling basic facts concerning the MCMC-SAEM coupling developed in 
[Kuhn and Lavielle, 2004], as described hereafter. This extension of the SAEM algo- 
rithm allows to find the maximum likelihood, or the maximum a posteriori, in mixture 
models where the conditional distribution of the hidden variables can only be sampled by 
an MCMC algorithm, as is our case. The main prerequisite in [Kuhn and Lavielle, 2004] 
is that the complete likelihood belongs to the curved exponential family. In our case, 
this means that: 

f{y,z\e) = /(y|z)7r(z|0) 

= exp-{v^(0) + (5(y,z);</.(0))}, (E.l) 

where (•; •) denotes the scalar product. Since we consider MAP estimation here, the 
above likelihood must be mutiplied by the prior density. Due to its conditionally conju- 
gate form, its expression is very close to that of the likelihood: 



7r(6>) = exp-{v.(0) + ^s,;0(6>))}. (E.2) 
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Second, it is assumed that the hidden variables z can be simulated using a transition 
kernel 7r0(-|-) whose stationary distribution is the conditional density 7r(z|y,0), for any 
value of 0. Then the A;-th iteration of the MCMC-SAEM algorithm contains the following 
steps: 

• Simulation. Draw a new value z^ from the current one Zfc_i, according to: z^ ~ 

7r0fe_i(zfc|zfc_i); 

• Stochastic Averaging. Update the expected value s of the sufficient statistics 
by computing the weighted sum: s^ = s^~^ + c^ (5(y,z/c) — s^~^) ; 

• Maximization. Update the parameter value 0^-1 by computing: 
Ok = argmine {^{6) + V.C^) + {s^ + s^; 0(0))} . 

The sequence of positive coefficients {ck)k>o must verify: X^^Cfc = oo but Ylk^\ < 
DO. Then, under certain regularity conditions concerning the functions 0(-), S'(-, •), and 
ip{-)^ the sequence {Ok)k>o converges to a local maximum of the posterior density. The 
rationale underlying the SAEM algoritm is that the stochastic averaging step provides 
a Monte-Carlo estimate of the expected complete likelihood E,[f{y,z,0)\y,Ok-i] when 
it cannot be computed analytically, as in the E-step of a standard EM algorithm. 

In our case, the likelihood and prior can be factorized across regions, except for the part 
depending on the spatial parameter a^, so the log-product of the right members in (E.l) 
and (E.2) can conveniently be written as 



log /(y, z, 0) = C(y, z) + i^sia^) + SsMM^^l 



(E.3) 



^iV 



+ E=i i^j{9j) + (s,{zy,4>j{e, 



where the term C(y, z) is independent from 9 and therefore irrelevant for the maximiza- 
tion. 
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and for all j = 1, . . . , A^ : 

V.,(0,) = (a + l)loga|+(a + l + ^4^)log.| + 7,^'^^; (E.5) 
•^^■(^) = ^ ( E ^'^^ E E (^^'' - '"'^)'; 2/3 + ^ ^^l■ 27,- J] /xfc] (E.6) 






Thus, the MCMC-SAEM algorithm starts with initial values zq, ^O) ■Sg, (s'^)i<j<Af and 
iterates the following steps for A; = 1, . . . , ET : 

• Simulation. Update the hidden variables z^ = (xfc,Wfc,/^fc) by simulating them 
conditionnally on the current values (zfc_i, 0,fc_i), as described in Appendix D. 

• Stochastic Averaging. Update the expected values s^'^ '(■^i^ )^<j<N of the suf- 
ficient statistics by computing s| = Sg~ + c^ [Ssi'^k) — Sg~ ) , and for all j : 

S'j = s']'^ + Ck (^Sj{zk) - sf 

• Maximization. Update the parameter value 0^-1 according to: 

(E.8) 
(E.9) 

(E.IO) 
(E.ll) 

To avoid getting trapped in a local maximum far from the global maximum, the stochas- 
tic averaging coefficients are set to c^ = 1 for iterations k = 1, . . . ,Kq, corresponding 
to a 'burn-in' period, during which the algorithm explores the parameter space. Sub- 
sequently, convergence to a local maximum is obtained by choosing CK^+k = ^ for 
k = l,...,K-Ko. 



rr^ 


^S 


^S,k 


*i:2 




a + 1 + sl^ 


^3,k 




11j,k 


d,+A/ 



Appendix F 

Likelihood expression conditional 
on the displacements 



Having estimated using the SAEM algorithm above, we now address the computation 
of the hkehhood for region j, conditional on the elementary displacements w and on the 
indicator variable. This will be used to compute the marginal likelihood approximation 
given by (5.27). The choice of a particular value w* is addressed in the forthcoming 
Appendix G, as it uses the expression for the conditional likelihood derived here. 

From (5.5), it follows that the observations can be separated in conditionally independent 
blocks, depending on the voxel k they are displaced to. Thus, noting y^, = {yn, 4>i{l) = k) 
the vector of observations displaced to voxel /c, sorted e.g. in lexicographic order, we 
have: 



/(yvjw,0,) = J{f{ykW,e,), 



where 



and the covariance matrix I]^ is equal to: 

^k = i^|l'„fclnfc + f^jlnfc + diag(4)i,«; 0,(/)= 
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Consequently, the log-conditional likelihood boils down to the sum for all voxels in 
region j of the explicit quantities: 



log/(yfc|w,0j) 

(F.l) 

= -^ log(27r) - - log lEfcl -^{fk- %lnJ'S^^(yfc - 7?jl„J. 

The determinant and inverse of S^ are easily computed thanks to the matrix determi- 
nant lemma, and the Shermann-Morrison lemma (see [Golub and Van Loan, 1996] for 
instance): 

|A + uu'| = (l + u'A-^u)|A|; (F.2) 

(A + uu')-^ = A-i--— — 3^, (F.3) 

1 + u'A u 

valid for any invertible n x n matrix A and any n x 1 vector u, such that 
u'A^u > -1. 

We apply these results to A = cr'jlnk + diag(s?;)j^;.<^.(;)=fc and u = Ujlnk- Using (F.2), 
we obtain: 

1+-I E ^) n (-K4). (F.4) 

i,l\(t>S)=k i «'^ i,l;<j>i(l)=k 

Next, defining X = y^, — r/jlnj., we apply (F.3), yielding: 

X'(A + uu')-^X = X' f A-i - ^— ^^^^^ X 

1 + u'A ^u 



X'A-'X-H^ . 
l + u'A-^u 



so that 



(yfc-^ilnJ'5];i^(yfc-r/jl„J= Y^ 

■^ ^-^ a- + s • 

i,l;4>i{l)=k 3 I 

(F.5) 



{yu -Vjf 


V,,;^^) = .^l+4y 


-1 + 4 


i,l;<j>i(l)=k J «' 
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Hence the exact value of log f {y j^lw , 6 j) can be computed following (F.l), (F.4) and 
(F.5). 



Appendix G 

Most probable displacement field 
a posteriori by simulated 
annealing 



As explained in Section 5.5, the marginal likelihood is approximated by conditioning on 
a certain value of the elementary displacements w, defined as the most probable value 
given the data and the estimated parameter 9 : 

w = argmax7r(w|y, 0). (G.l) 

w 

The conditional posterior density is proportional to: 

7r(w|y,0) « /(y|w,0)7r(w|4), (G.2) 

given the a priori independence of w from all other variables conditional on a^. Though 
the right member of (G.2) can be computed explicitely, using the formulas derived in 
Section F, its maximization with respect to w is difficult, because of the high number of 
dimensions involved, and of the complexity of the objective function. In particular, it is 
neither differentiable nor convex, thus prohibiting the use of standard gradient methods. 
Instead, we use the Simulated Annealing (SA) algorithm [Kirkpatrick et al., 1983], as 
described below, which is well-adapted to this setting. 
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Given the objective function TT(w\y,0) we wish to maximize, we define for all a > 
the modified density iTai'w) oc n" {w\y , 0) . These densities share the same modes, but 
differ in the contrast of their landscape, which increases with a. At the limit a — )• oo, 
TTa converges weakly to a combination of Dirac mass in its modes. 

The SA algorithm works by simulating successive values (w*)j>i from transition kernels 
Kat{-,-) with stationary distributions vTaj, where at increases progressively, so that the 
resulting Markov chain at first explores many possible states, and then gets attracted 
with more and more strength toward a mode of ^Ta^ . This heuristic can be straightened 
by showing that for any cooling schedule {at)t>i such that at — )• c«, then w* converges 
almost surely to the global maximum 7r(w|y, 6), as t — >• oo [Granville et al., 1994]. 

As in Appendix D, each elementary displacement Wj f, is updated in turn under the 
target density tTq,, using a M-H step, by simulating a candidate w'^ ^ from the random- 
walk proposal 

9a(w-,6|wi,fc) = AA (w- b; Wi,5, a" Vl^ylg) . (G.3) 

The proposal is then accepted with probability 



a = min -^ 1, 



r(y|w^0)AA(w^,;O,4/aI3) | 
/"(y|w, 6) ^f{^i,b■, 0, ay ah) J ' 



where w' is obtained from the current value w, replacing ■Wj.f, by w^ ^, and /°(y|w', 6) 
is computed from (F.l), (F.4) and (F.5) in Section F. 

We used the cooling schedule: at = t~*, where r G (0, 1) is the cooling rate. In practice, 
we found that setting r = 99% and letting the algorithm run for T = 100 iterations gave 
satisfying results. 



Appendix H 

Likelihood under spatial 
uncertainty, by Chib's method 



Following the notations introduced in Section 5.5.2, we start by showing how to compute 
the reduced posterior ordinates 7r(w*j,|w^j^, 0*,y), for i = l,...,n and b = 1,...,B, 
following [Chib and Jeliazkov, 2001]. First, note that the full conditional density of ■w*j, 
is given by: 

7r(w*,| w*_,„ w+^^ e*,y)^ /(y I w*„ w!.,„ w+'■^ 0* )7r(w*„ w*_,„ w+''^|r ), 

where w"^ denotes the collection of blocks outside w_j;, and distinct from Wj^, that is, 
of blocks Wj'f,/ for i' > i, b' > b, and {i',b') ^ {i,b). As described in Appendix G, This 
conditional density can be sampled by the MH algorithm, with proposal 

and acceptance rate: 

a(wi6, w-{,|wli^, w+*^ 0*,y) 

. j^ ^(w^,|w*_,„w+^^0^y) g(w^„w,b|w*_,„w+^^r,y)\ 
— TTiin \ i X y 
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(in fact, given that the proposal density q is symmetric, the ratio to the right in the 
above display simplifies to 1). 

It can be verified by direct calculation that the transition kernel 



= g(w*„ Wifel wl,„ w+^^ 0*, y)a(w*„ Wib|wl,„ w+^^ 0*, y) 
verifies the local reversibility condition: 

7r(w*„ w,,|w*_,„ w+^^ r , y)^(w*,|wl,„ w+^^ 0* , y) 
= 7r(w,6, w*,|w*_.„ w+^^ r , y)7r(w,fe|w*_,„ w+*^ r , y). 



(H.l) 



Multiply both sides of this equation by 7r(w"*"*^|w^^^, 0*,y), and integrate over both 
.^+46 ^j-^j vjiij, to obtain: 



/ 7r(w*„ Wife|w*_.„ w+^^ r , y)7r(w*„ w+^V-^^' 0* ,y)d^+'^d^n, 



Next, express 7r(w*^, w+*^|vif;^j^, 0*,y) as: 
7r(w*Jwl.^,6'*,y)^(w+^V-i6,w*^,6>*,y), and apply (H.l), yielding: 



7r(Wif,|w_,j,0 ,y) 



El [g(wi6, w*,| w*_ .„ w+^^ e*,y)a{wik, w*,|w*_,„ w+*^ r , y)] 



E2[a(w*„Wib|w*_.„w+^^r,y)] 

(H.2) 

where Ei is the expectation with respect to the conditional posterior 7r(wjfe, ■w"'" |w*l^^, 0*, y), 

and E2 with respect to q{w*^,Wib\w*_^f^,w+'\e* ,y)7r{w+'''\w*_.^,w*^,e* ,y). 
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Both integrals can be estimated from the output of reduced multiple-block MH runs, as 
follows: 

1. To estimate the denominator, sample the conditional posterior: 7r(wjf,, w"*" |w^^^, 0* ,y) 
Let (Wj-^ , w+*^'(^))i<g<G' stand for the generated sample. 

2. Next, include w^^^ in the conditioning set, and generate a sample (w+*^'(''))i<j<j 
from the reduced conditional posterior 7r(w+*''|wljjj, w*^^, 0*,y). For each j, draw 
w^^ from the proposal density (7(w*(,,w^^ I'w*!^^,, w"'"* ,0*,y). Then, form the 
sample (w^S w+*'''(-''))i<j<j. 

3. Estimate the conditional posterior density in (H.2) by: 



7r(Wi5|w_ife,0 ,y) 



G-' EU g(wS,^\ w*,|w*_,„ w+'^-(g), r , y)a(w;g\ w*,|w1,„ w+^''(3\e* ,y) 
J-' E/=i "(w*,, w2 Vl.„ w+^M^-), r , y) 

Repeat steps 1 — 3 for i = 1, . . . , n and b = 1, . . . ,B. Finally, estimate the likelihood in 
the log scale as: 

n B 

Iog/(y|r,7)=log/(y|w*,r,7) + log7r(w*|0*)-^j;iog7r(w*,|w*_,„r,y). 

i=l 6=1 



It can be noted that the values (w+*''''''^)i<j<j generated in step 2 are also produced in 
step 1 of the next reduced run. This means that nxB reduced runs are necessary instead 
of 2n X B. In practical application, we found that the successive reduced runs of the 
multiple block MH algorithm required fine tuning; indeed, the variability of the sampled 
transition kernel values 7r(Wj.^ , ■w*^|w^j^,w"'"*'''(^\ 0*,y) increased with the number of 
hidden variables included in the conditioning set 6*. A possible explanation to this is 
that the modes of the reduced conditional distributions were more and more distant 
from the mode w* of the full conditional, where the kernel was evaluated. We found 
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empirically that using a number of iterations for each run inversely proportional to the 
number of sampled blocks worked best. 
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